SlideShare une entreprise Scribd logo
1  sur  44
Authority Control for
Digital Collections

October
18, 2013

Jeremy Myntti
Head of Cataloging & Metadata Services
Nate Cothran
Vice President, Automation Services
Why Authority Control?

Vocabulary control
 Consistency
 Standardization
 Updating mechanism


2
Inconsistency
Gosiute Indians
Goshute Indians
Navajo Indians
Navaho Indians

Salt Lake
Salt Lake City
Salt Lake City (Utah)
Bishop, Dail Stapley
Bishop, Dale Stapely
Bishop, Dale Stapley

Beckwith, Frank A. (1876-1951)
Beckwith, Frank Asahel (18761951)
Beckwith, Frank A.
Beckwith, Frank A. (1876-1951)
Beckwith, Frank Asahel (1876-1951)
Beckwith, Frank Asahel, 1876-1951
Hansen, Henrie
Hansen, Henry
Hansen, Henry Daniel, 1896-1979

3
Partnership

4
Raw MARC
00491nam 22001571
450000100140000000300080001400500170
002200800410003904000120008010000310
009224500380012326000100016130000180
0171505012700189650001700316ovld0000
00001UtOrBLW20121220145806.0121220s1
991 xx
eng u aUtOrBLW1
aMercer, Leigh,d1893-1977.12aA man, a
plan, a canal -- Panama! c1991. a303 p.
;ccm. aDo geese see God? -- Murder for a
jar of red rum -- Some men interpret nine
memos - Never odd or even. 0aPalindromes.
5
Raw MARC
00491nam 22001571
450000100140000000300080001400500170
002200800410003904000120008010000310
009224500380012326000100016130000180
0171505012700189650001700316ovld0000
00001UtOrBLW20121220145806.0121220s1
991 xx
eng u aUtOrBLW1
aMercer, Leigh,d1893-1977.12aA man, a
plan, a canal -- Panama! c1991. a303 p.
;ccm. aDo geese see God? -- Murder for a
jar of red rum -- Some men interpret nine
memos -- Never odd or even.
0aPalindromes.
6
MARC Display
001

ovld0123456789

…
100 1
245 12
…
505

650

0

$a Mercer, Leigh, $d 1893-1977.
$a A man, a plan, a canal – Panama!
$a Do geese see God? – Murder for a jar of
red rum – Some men interpret nine memos –
Never odd or even.

$a Palindromes

7
MARC as a Container

Portable
Entrenched
Tools

8
MARC History
MARC

MARC

MARC

MARC

MARC

RDA

1960s

1970s

1980s

1990s

2000s

2010s

XML

XML

XML

MARC?

9
MARC Display
001

ovld0123456789

…
100 1
245 12
…
505

650

0

$a Mercer, Leigh, $d 1893-1977.
$a A man, a plan, a canal – Panama!
$a Do geese see God? – Murder for a jar of
red rum – Some men interpret nine memos –
Never odd or even.

$a Palindromes

10
MARC XML
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
<leader>00491nam 22001571 4500</leader>
<controlfield tag="001">ovld000000001</controlfield>
…
<datafield tag="100" ind1="1" ind2=" "><subfield code="a">Mercer, Leigh,</subfield>
<subfield code="d">1893-1977. </subfield></datafield>
<datafield tag="245" ind1="1" ind2=“2"><subfield code="a">A man, a plan, a canal -- Panama!
</subfield></datafield>
…
<datafield tag="505" ind1=" " ind2=" "><subfield code="a">Do geese see God? -- Murder for a
jar of red rum -- Some men interpret nine memos -- Never odd or even.</subfield></datafield>
<datafield tag="650" ind1=" " ind2="0"><subfield code="a">Palindromes.</subfield></datafield>
</record>

</collection>

11
MARC XML
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
<leader>00491nam 22001571 4500</leader>
<controlfield tag="001">ovld000000001</controlfield>
…
<datafield tag="100" ind1="1" ind2=" "><subfield code="a">Mercer, Leigh,</subfield>
<subfield code="d">1893-1977. </subfield></datafield>
<datafield tag="245" ind1="1" ind2=“2"><subfield code="a">A man, a plan, a canal -- Panama!
</subfield></datafield>
…
<datafield tag="505" ind1=" " ind2=" "><subfield code="a">Do geese see God? -- Murder for a
jar of red rum -- Some men interpret nine memos -- Never odd or even.</subfield></datafield>
<datafield tag="650" ind1=" " ind2="0"><subfield code="a">Palindromes.</subfield></datafield>
</record>

</collection>

12
XML Defined







XML stands for eXtensible Markup Language
XML is similar to HTML, but also different
XML was designed to carry data, not display data
XML tags are not pre-defined; you must define tags
XML is designed to be self-descriptive
XML is a W3C recommendation

13
XML Invented





There are no defined XML tags (title = <title>)
Tags in HTML are defined (<p> = paragraph)
Anyone can design their own XML schema
Common library-centric XML schemas include:
◦ MODS, MARC XML, EAD, Dublin Core, ONIX, XC
◦ METS Alto, TEI, BIBFRAME / RDF, CONTENTdm
◦ All of these either follow their own schema or borrow
elements from each other

14
XML Structure, Tree


Increases in complexity
◦ Root
◦ Branches
◦ Leaves

15
XML Structure, Family

Parents > Children
 Children : Siblings
 Children > Attributes
 Attributes : Values


16
MARC XML
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
<leader>00491nam 22001571 4500</leader>
<controlfield tag="001">ovld000000001</controlfield>
…
<datafield tag="100" ind1="1" ind2=" "><subfield code="a">Mercer, Leigh,</subfield>
<subfield code="d">1893-1977. </subfield></datafield>
<datafield tag="245" ind1="1" ind2=“2"><subfield code="a">A man, a plan, a canal -- Panama!
</subfield></datafield>
…
<datafield tag="505" ind1=" " ind2=" "><subfield code="a">Do geese see God? -- Murder for a
jar of red rum -- Some men interpret nine memos -- Never odd or even.</subfield></datafield>
<datafield tag="650" ind1=" " ind2="0"><subfield code="a">Palindromes.</subfield></datafield>
</record>

</collection>

17
MARC XML, Parent / Child
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://www.loc.gov/MARC21/slim">
<parent>
<child>00491nam 22001571 4500</child>
<child att="001">ovld000000001</child>
…
<child att="100" att="1" att=" "><child att="a">Mercer, Leigh,</child><child att="d">1893-1977.
</child></child>
<child att="245" att="1" att= “2"><child att="a">A man, a plan, a canal -- Panama!</child></child>
…
<child att="505" att=" " att=" "><child att="a">Do geese see God? -- Murder for a jar of red rum
-- Some men interpret nine memos -- Never odd or even.</child></child>
<child att="650" att=" " att="0"><child att="a">Palindromes.</child></child>
</parent>

</root>

18
Methodology


Extracting data from CONTENTdm
◦ Stop updates to collection and make it read-only
◦ Make copy of desc.all metadata file for backup
◦ Run desc.all file through AC processing
◦ Replace desc.all file on CONTENTdm server
◦ Run the full collection index
◦ Remove read-only status from collection
<creata>Ogburn, Joyce L.</creata>
<title>Acquiring minds want to know: digital scholarship</title>
<date>2003</date>
<descri>A new form of scholarship has emerged in recent years named "digital
scholarship.“</descri>
<type>text;</type>
<subjec>Born digital; Libraries; Electronic publishing</subjec>
<subjea>Libraries and electronic publishing; Scholarly electronic publishing
</subjea>
19
Normalized Headings
MARC
600 $a Smith, John, $d 1947 Apr. 16651 $a United States $x History $y 20th century.
600 $ SMITH, JOHN $ 1947 APR 16
651 $ UNITED STATES $ HISTORY $ 20TH CENTURY
XML
<subject>Smith, John, 1947 Apr. 16-;
United States--History--20th century;</subject>

<subject>SMITH, JOHN, 1947 APR 16;
UNITED STATES--HISTORY--20TH CENTURY;</subject>
20
MARC & XML, Normalized


MARC
◦ 651 $a United States $x History $y 20th century.
◦ 651 $ UNITED STATES $ HISTORY $ 20TH CENTURY



XML
◦

United States--History--20th century;

◦ 650 $a United States $x History $x 20th century
◦ 650 $ UNITED STATES $ HISTORY $ 20TH CENTURY

21
MARC & XML Examples


MARC
◦ 600 $a Smith, John, $d 1947 Apr. 16◦ 651 $a United States $x History $y 20th century.
◦ 650 $a Navajo Indians $x History.
◦ 650 $a Religious ceremonies.

◦ 650 $a Religion.


XML
◦ Smith, John, 1947 Apr. 16-; United States--History--20th century;
Navajo Indians--History; Religious ceremonies; Religion;

22
MARC & XML Conversions


MARC
◦ 600 $a Smith, John, $d 1947 Apr. 16-

◦ 651 $a United States $x History $y 20th century.
◦ 650 $a Navajo Indians $x History.
◦ 650 $a Religious ceremonies.
◦ 650 $a Religion.


XML
◦ 600 $a Smith, John, $d 1947 Apr. 16◦ 650 $a United States $x History $x 20th century
◦ 650 $a Navajo Indians $x History
◦ 650 $a Religious ceremonies
◦ 650 $a Religion
23
XML Authority Matching


XML
◦ 600 $a Smith, John, $d 1947 Apr. 16◦ 650 $a United States $x History $x 20th century
◦ 650 $a Navajo Indians $x History
◦ 650 $a Religious ceremonies
◦ 650 $a Religion



Authority
◦
◦
◦
◦
◦
◦
◦
◦
◦

150 $a Rites and ceremonies
450 $a Ceremonies
450 $a Cult
450 $a Cultus
450 $a Ecclesiastical rites and ceremonies
450 $a Religious ceremonies
450 $a Religious rites
450 $a Rites of passage
450 $a Traditions

24
XML Authority Control


XML
◦ <subject>Smith, John, 1947 Apr. 16-; United States-History--20th century; Navajo Indians--History; Rites
and ceremonies; Religion;</subject>



Variants
◦ <subject-keyword>Ceremonies; Cult; Cultus;
Ecclesiastical rites and ceremonies; Religious
ceremonies; Religious rites; Rites of passage;
Traditions;</subject-keyword>

25
MARC vs XML Tags


MARC
◦ 600, 610, 611, 630, 650, 651, 655



XML
◦ <subject>, <subjeca>, <subjec-1>, <subjecttitle>, <topics>, <entry>, <creata>, <s01>, <genre>, e
tc

26
Pre-CONTENTdm Processing
<records>
<record>
<title>…</title>
</record>
<record>
<title>…</title>
</record>
</records>


Temporary container elements are added

27
Post-CONTENTdm Processing

<title>…</title>

<title>…</title>



Temporary container elements are removed

28
CDM Before & After
after
title
subject
creator

after
title
creator
subjectkeyword

before
title
creator
subject

after
title
creator
subject
keyword

after
title
creator
subject
29
Other Issues
Bad OCR data
 <descr> or <transc> field elements
 Errant angle brackets: < or >
 CDATA workaround (pre-processing)


<transc>...
immunofluorescence at
sites of actin-membrane<
f'' " • - t . "U . * • ?vl /&gt;/>
. * .. O' ' .< .* V#•* * •• * r , .
* . y1 •*
...</transc>

• System tries to assign
< … > as discrete child
of transc

30
Other Issues
Bad OCR data
 <descr> or <transc> field elements
 Errant angle brackets: < or >
 CDATA workaround (pre-processing)


• CDATA blocks effectively
cancel out all data between
them

• CDATA = character data

<![CDATA[<transc>...
immunofluorescence at
sites of actin-membrane<
f'' " • - t . "U . * • ?vl /&gt;/>
. * .. O' ' .< .* V#•* * •• * r , .
* . y1 •*
...</transc>]]>
31
Statistics for USpace
Changed Records

Matched Records

Changed
35%

Subjects
33%

Records
6,308

Records
15,800

Matched Records

Names
28%

Records
22,306

32
Standardization: Missing semicolons



Smith, John, 1932- United States-History --20th century;



Smith, John, 1932-; United States-History --20th century;

33
Standardization: Subdivisions

United States--History--20th century;
 United States – History – 20th century;
 United States-History-20th century;
 United States-History--20th century;


34
Standardization: Dates
Improper
 1980-3;
 20130526;
 May 26, 2013;
Proper
 1980; 1981; 1982; 1983;
 2013-05-26;
35
Reports
Partially Matched Subject Headings
Unmatched Names
Split Subject Headings

Updated Subject Headings

36
Near Match Report
Levenshtein-Distance Algorithm

Near Matches

37
Future Plans: Ongoing AC via Linked Data

38
39
Future Plans: Ongoing AC via Linked Data

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<madsrdf:PersonalName
rdf:about="http://id.loc.gov/authorities/names/n88283447">
<rdf:type rdf:resource="http://www.loc.gov/mads/rdf/v1#Authority"/>
<madsrdf:authoritativeLabel xml:lang="en">Ogburn, Joyce L.
</madsrdf:authoritativeLabel>
...
</madsrdf:PersonalName>
</rdf:RDF>

40
Linked Data: Cloud Diagram

Linking Open Data cloud diagram,
by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/

41
Future Plans: Legacy Collections

42
Other XML File Structures
Dublin Core

ONIX

record>recordData>dc:creator

Product>Contributor>PersonNameInverted

record>recordData>dc:subject

Product>Contributor>PersonNameInverted>PersonDate>Date

record>recordData>dc:identifier

Product>Conference>ConferenceName
Product>Conference>ConferenceDate
Product>Conference>ConferencePlace

EAD
archdesc>did>repository>corpname
archdesc>did>repository>controlaccess>persname

Product>MainSubject>SubjectHeadingText

Product>PlaceAsSubject
Product>CorporateName

archdesc>did>repository>controlaccess>subject

archdesc>controlaccess>geogname

XC
xc:entity>xc:creator

eadheader>eadid

xc:entity>xc:subject

archdesc>did>unitid

xc:entity>xc:type

xc:entity>xc: id…

43
Questions
Jeremy Myntti | jeremy.myntti@utah.edu
Head of Cataloging & Metadata Services

Nate Cothran | nate@bslw.com
Vice President, Automation Services

44

Contenu connexe

Dernier

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

En vedette

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

En vedette (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Authority Control for Digital Collections Metadata

  • 1. Authority Control for Digital Collections October 18, 2013 Jeremy Myntti Head of Cataloging & Metadata Services Nate Cothran Vice President, Automation Services
  • 2. Why Authority Control? Vocabulary control  Consistency  Standardization  Updating mechanism  2
  • 3. Inconsistency Gosiute Indians Goshute Indians Navajo Indians Navaho Indians Salt Lake Salt Lake City Salt Lake City (Utah) Bishop, Dail Stapley Bishop, Dale Stapely Bishop, Dale Stapley Beckwith, Frank A. (1876-1951) Beckwith, Frank Asahel (18761951) Beckwith, Frank A. Beckwith, Frank A. (1876-1951) Beckwith, Frank Asahel (1876-1951) Beckwith, Frank Asahel, 1876-1951 Hansen, Henrie Hansen, Henry Hansen, Henry Daniel, 1896-1979 3
  • 5. Raw MARC 00491nam 22001571 450000100140000000300080001400500170 002200800410003904000120008010000310 009224500380012326000100016130000180 0171505012700189650001700316ovld0000 00001UtOrBLW20121220145806.0121220s1 991 xx eng u aUtOrBLW1 aMercer, Leigh,d1893-1977.12aA man, a plan, a canal -- Panama! c1991. a303 p. ;ccm. aDo geese see God? -- Murder for a jar of red rum -- Some men interpret nine memos - Never odd or even. 0aPalindromes. 5
  • 6. Raw MARC 00491nam 22001571 450000100140000000300080001400500170 002200800410003904000120008010000310 009224500380012326000100016130000180 0171505012700189650001700316ovld0000 00001UtOrBLW20121220145806.0121220s1 991 xx eng u aUtOrBLW1 aMercer, Leigh,d1893-1977.12aA man, a plan, a canal -- Panama! c1991. a303 p. ;ccm. aDo geese see God? -- Murder for a jar of red rum -- Some men interpret nine memos -- Never odd or even. 0aPalindromes. 6
  • 7. MARC Display 001 ovld0123456789 … 100 1 245 12 … 505 650 0 $a Mercer, Leigh, $d 1893-1977. $a A man, a plan, a canal – Panama! $a Do geese see God? – Murder for a jar of red rum – Some men interpret nine memos – Never odd or even. $a Palindromes 7
  • 8. MARC as a Container Portable Entrenched Tools 8
  • 10. MARC Display 001 ovld0123456789 … 100 1 245 12 … 505 650 0 $a Mercer, Leigh, $d 1893-1977. $a A man, a plan, a canal – Panama! $a Do geese see God? – Murder for a jar of red rum – Some men interpret nine memos – Never odd or even. $a Palindromes 10
  • 11. MARC XML <?xml version="1.0" encoding="UTF-8"?> <collection xmlns="http://www.loc.gov/MARC21/slim"> <record> <leader>00491nam 22001571 4500</leader> <controlfield tag="001">ovld000000001</controlfield> … <datafield tag="100" ind1="1" ind2=" "><subfield code="a">Mercer, Leigh,</subfield> <subfield code="d">1893-1977. </subfield></datafield> <datafield tag="245" ind1="1" ind2=“2"><subfield code="a">A man, a plan, a canal -- Panama! </subfield></datafield> … <datafield tag="505" ind1=" " ind2=" "><subfield code="a">Do geese see God? -- Murder for a jar of red rum -- Some men interpret nine memos -- Never odd or even.</subfield></datafield> <datafield tag="650" ind1=" " ind2="0"><subfield code="a">Palindromes.</subfield></datafield> </record> </collection> 11
  • 12. MARC XML <?xml version="1.0" encoding="UTF-8"?> <collection xmlns="http://www.loc.gov/MARC21/slim"> <record> <leader>00491nam 22001571 4500</leader> <controlfield tag="001">ovld000000001</controlfield> … <datafield tag="100" ind1="1" ind2=" "><subfield code="a">Mercer, Leigh,</subfield> <subfield code="d">1893-1977. </subfield></datafield> <datafield tag="245" ind1="1" ind2=“2"><subfield code="a">A man, a plan, a canal -- Panama! </subfield></datafield> … <datafield tag="505" ind1=" " ind2=" "><subfield code="a">Do geese see God? -- Murder for a jar of red rum -- Some men interpret nine memos -- Never odd or even.</subfield></datafield> <datafield tag="650" ind1=" " ind2="0"><subfield code="a">Palindromes.</subfield></datafield> </record> </collection> 12
  • 13. XML Defined       XML stands for eXtensible Markup Language XML is similar to HTML, but also different XML was designed to carry data, not display data XML tags are not pre-defined; you must define tags XML is designed to be self-descriptive XML is a W3C recommendation 13
  • 14. XML Invented     There are no defined XML tags (title = <title>) Tags in HTML are defined (<p> = paragraph) Anyone can design their own XML schema Common library-centric XML schemas include: ◦ MODS, MARC XML, EAD, Dublin Core, ONIX, XC ◦ METS Alto, TEI, BIBFRAME / RDF, CONTENTdm ◦ All of these either follow their own schema or borrow elements from each other 14
  • 15. XML Structure, Tree  Increases in complexity ◦ Root ◦ Branches ◦ Leaves 15
  • 16. XML Structure, Family Parents > Children  Children : Siblings  Children > Attributes  Attributes : Values  16
  • 17. MARC XML <?xml version="1.0" encoding="UTF-8"?> <collection xmlns="http://www.loc.gov/MARC21/slim"> <record> <leader>00491nam 22001571 4500</leader> <controlfield tag="001">ovld000000001</controlfield> … <datafield tag="100" ind1="1" ind2=" "><subfield code="a">Mercer, Leigh,</subfield> <subfield code="d">1893-1977. </subfield></datafield> <datafield tag="245" ind1="1" ind2=“2"><subfield code="a">A man, a plan, a canal -- Panama! </subfield></datafield> … <datafield tag="505" ind1=" " ind2=" "><subfield code="a">Do geese see God? -- Murder for a jar of red rum -- Some men interpret nine memos -- Never odd or even.</subfield></datafield> <datafield tag="650" ind1=" " ind2="0"><subfield code="a">Palindromes.</subfield></datafield> </record> </collection> 17
  • 18. MARC XML, Parent / Child <?xml version="1.0" encoding="UTF-8"?> <root xmlns="http://www.loc.gov/MARC21/slim"> <parent> <child>00491nam 22001571 4500</child> <child att="001">ovld000000001</child> … <child att="100" att="1" att=" "><child att="a">Mercer, Leigh,</child><child att="d">1893-1977. </child></child> <child att="245" att="1" att= “2"><child att="a">A man, a plan, a canal -- Panama!</child></child> … <child att="505" att=" " att=" "><child att="a">Do geese see God? -- Murder for a jar of red rum -- Some men interpret nine memos -- Never odd or even.</child></child> <child att="650" att=" " att="0"><child att="a">Palindromes.</child></child> </parent> </root> 18
  • 19. Methodology  Extracting data from CONTENTdm ◦ Stop updates to collection and make it read-only ◦ Make copy of desc.all metadata file for backup ◦ Run desc.all file through AC processing ◦ Replace desc.all file on CONTENTdm server ◦ Run the full collection index ◦ Remove read-only status from collection <creata>Ogburn, Joyce L.</creata> <title>Acquiring minds want to know: digital scholarship</title> <date>2003</date> <descri>A new form of scholarship has emerged in recent years named "digital scholarship.“</descri> <type>text;</type> <subjec>Born digital; Libraries; Electronic publishing</subjec> <subjea>Libraries and electronic publishing; Scholarly electronic publishing </subjea> 19
  • 20. Normalized Headings MARC 600 $a Smith, John, $d 1947 Apr. 16651 $a United States $x History $y 20th century. 600 $ SMITH, JOHN $ 1947 APR 16 651 $ UNITED STATES $ HISTORY $ 20TH CENTURY XML <subject>Smith, John, 1947 Apr. 16-; United States--History--20th century;</subject> <subject>SMITH, JOHN, 1947 APR 16; UNITED STATES--HISTORY--20TH CENTURY;</subject> 20
  • 21. MARC & XML, Normalized  MARC ◦ 651 $a United States $x History $y 20th century. ◦ 651 $ UNITED STATES $ HISTORY $ 20TH CENTURY  XML ◦ United States--History--20th century; ◦ 650 $a United States $x History $x 20th century ◦ 650 $ UNITED STATES $ HISTORY $ 20TH CENTURY 21
  • 22. MARC & XML Examples  MARC ◦ 600 $a Smith, John, $d 1947 Apr. 16◦ 651 $a United States $x History $y 20th century. ◦ 650 $a Navajo Indians $x History. ◦ 650 $a Religious ceremonies. ◦ 650 $a Religion.  XML ◦ Smith, John, 1947 Apr. 16-; United States--History--20th century; Navajo Indians--History; Religious ceremonies; Religion; 22
  • 23. MARC & XML Conversions  MARC ◦ 600 $a Smith, John, $d 1947 Apr. 16- ◦ 651 $a United States $x History $y 20th century. ◦ 650 $a Navajo Indians $x History. ◦ 650 $a Religious ceremonies. ◦ 650 $a Religion.  XML ◦ 600 $a Smith, John, $d 1947 Apr. 16◦ 650 $a United States $x History $x 20th century ◦ 650 $a Navajo Indians $x History ◦ 650 $a Religious ceremonies ◦ 650 $a Religion 23
  • 24. XML Authority Matching  XML ◦ 600 $a Smith, John, $d 1947 Apr. 16◦ 650 $a United States $x History $x 20th century ◦ 650 $a Navajo Indians $x History ◦ 650 $a Religious ceremonies ◦ 650 $a Religion  Authority ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ 150 $a Rites and ceremonies 450 $a Ceremonies 450 $a Cult 450 $a Cultus 450 $a Ecclesiastical rites and ceremonies 450 $a Religious ceremonies 450 $a Religious rites 450 $a Rites of passage 450 $a Traditions 24
  • 25. XML Authority Control  XML ◦ <subject>Smith, John, 1947 Apr. 16-; United States-History--20th century; Navajo Indians--History; Rites and ceremonies; Religion;</subject>  Variants ◦ <subject-keyword>Ceremonies; Cult; Cultus; Ecclesiastical rites and ceremonies; Religious ceremonies; Religious rites; Rites of passage; Traditions;</subject-keyword> 25
  • 26. MARC vs XML Tags  MARC ◦ 600, 610, 611, 630, 650, 651, 655  XML ◦ <subject>, <subjeca>, <subjec-1>, <subjecttitle>, <topics>, <entry>, <creata>, <s01>, <genre>, e tc 26
  • 29. CDM Before & After after title subject creator after title creator subjectkeyword before title creator subject after title creator subject keyword after title creator subject 29
  • 30. Other Issues Bad OCR data  <descr> or <transc> field elements  Errant angle brackets: < or >  CDATA workaround (pre-processing)  <transc>... immunofluorescence at sites of actin-membrane< f'' " • - t . "U . * • ?vl /&gt;/> . * .. O' ' .< .* V#•* * •• * r , . * . y1 •* ...</transc> • System tries to assign < … > as discrete child of transc 30
  • 31. Other Issues Bad OCR data  <descr> or <transc> field elements  Errant angle brackets: < or >  CDATA workaround (pre-processing)  • CDATA blocks effectively cancel out all data between them • CDATA = character data <![CDATA[<transc>... immunofluorescence at sites of actin-membrane< f'' " • - t . "U . * • ?vl /&gt;/> . * .. O' ' .< .* V#•* * •• * r , . * . y1 •* ...</transc>]]> 31
  • 32. Statistics for USpace Changed Records Matched Records Changed 35% Subjects 33% Records 6,308 Records 15,800 Matched Records Names 28% Records 22,306 32
  • 33. Standardization: Missing semicolons  Smith, John, 1932- United States-History --20th century;  Smith, John, 1932-; United States-History --20th century; 33
  • 34. Standardization: Subdivisions United States--History--20th century;  United States – History – 20th century;  United States-History-20th century;  United States-History--20th century;  34
  • 35. Standardization: Dates Improper  1980-3;  20130526;  May 26, 2013; Proper  1980; 1981; 1982; 1983;  2013-05-26; 35
  • 36. Reports Partially Matched Subject Headings Unmatched Names Split Subject Headings Updated Subject Headings 36
  • 37. Near Match Report Levenshtein-Distance Algorithm Near Matches 37
  • 38. Future Plans: Ongoing AC via Linked Data 38
  • 39. 39
  • 40. Future Plans: Ongoing AC via Linked Data <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <madsrdf:PersonalName rdf:about="http://id.loc.gov/authorities/names/n88283447"> <rdf:type rdf:resource="http://www.loc.gov/mads/rdf/v1#Authority"/> <madsrdf:authoritativeLabel xml:lang="en">Ogburn, Joyce L. </madsrdf:authoritativeLabel> ... </madsrdf:PersonalName> </rdf:RDF> 40
  • 41. Linked Data: Cloud Diagram Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ 41
  • 42. Future Plans: Legacy Collections 42
  • 43. Other XML File Structures Dublin Core ONIX record>recordData>dc:creator Product>Contributor>PersonNameInverted record>recordData>dc:subject Product>Contributor>PersonNameInverted>PersonDate>Date record>recordData>dc:identifier Product>Conference>ConferenceName Product>Conference>ConferenceDate Product>Conference>ConferencePlace EAD archdesc>did>repository>corpname archdesc>did>repository>controlaccess>persname Product>MainSubject>SubjectHeadingText Product>PlaceAsSubject Product>CorporateName archdesc>did>repository>controlaccess>subject archdesc>controlaccess>geogname XC xc:entity>xc:creator eadheader>eadid xc:entity>xc:subject archdesc>did>unitid xc:entity>xc:type xc:entity>xc: id… 43
  • 44. Questions Jeremy Myntti | jeremy.myntti@utah.edu Head of Cataloging & Metadata Services Nate Cothran | nate@bslw.com Vice President, Automation Services 44

Notes de l'éditeur

  1. At Backstage, we are very conversant with MARC. We’ve been dealing with it for over 25 years. It’s comfortable for us.So when Jeremy first approached us about this project, it wasn’t without some trepidation &amp; anxiety on our side. Performing authority control on MARC files is one thing, but attempting to do so on non-MARC files is a different order of complexity.This whole time I’ve been talking, we have a MARC record up on the screen. Chances are we’re not any closer to figuring out what it means than when we first started. But sometimes it helps to take a step back and look at things in a different light. We can peel back the layers and notice patterns that didn’t exist previously.
  2. In this case, we change the color of the existing code a little. Now we can make out the fields and data a little more clearly. It’s this kind of mentality we used when we approached this project with Jeremy. We led with the idea of what if what was desired was not as difficult as we first imagined it?
  3. Even with MARC, this is how we’re really used to seeing things, isn’t it? This layout makes much more sense to us as most of our utilities or tools make use of this display.
  4. MARC itself has been a perfect container over the years. MARC is like an old friend where you can stop in, have a cup a coffee, and start up a conversation. Then, in the middle of that conversation, you can get up and leave for 10 years. When you come back, you can pick up that old conversation right where you left off.We can do this because MARC hasn’t changed during those years. It’s the same format we’re used to seeing. More importantly, our systems and tools know what to expect with MARC format because it follows the same conventions it always has.I can create a MARC file, zip it up, attach it to an email, or give you a thumb-drive and it’s ready for you right then and there. You can take it from me, manipulate it, then send it back over and we’re both happy.
  5. MARC had its start in the 1960s and continued from there. By contrast, XML started in 1996. MARC is useful for print materials, while XML is good for online materials. In particular, RDA is more suited for online environments.I have a question for MARC, which is, whether or not it will continue to be relevant down the road. MARC has its own limitations built in, unfortunately. Records are limited to finite sizes, while XML is expandable, malleable, dynamic.
  6. Okay, let’s go back to this example. This is how we’re used to seeing MARC. But let’s transform this into a typical MARC XML file.
  7. Like the code for raw MARC, this code for MARC XML can look a little busy. Without taking that initial step back, we might feel a little lost in trying to interpret what we’re seeing.
  8. But when we highlight elements and attributes in certain ways, we start to see MARC-related data that we are already familiar with.
  9. XML is similar to HTML in that both contain elements and tags. XML is meant to carry data while HTML is meant to display that data. If you open up your favorite web browser, then navigate to a website you like, when you look at the underlying structure or source of that website, it is in HTML.XML, on the other hand, is more likely used to carry the data to the HTML page. XML is also invented according to whoever is writing or maintaining it.
  10. At the same time, there is no core set of tags that must always be used when describing common elements. There are suggestions in place, of course, but ultimately it is up to the organization or the individual generating the XML code. You could start your own XML schema today and have people using it tomorrow, whereas with MARC you are constrained by what already exists.
  11. XML has been described as being similar to a tree. A tree has roots, branches, and leaves. Leaves represent more details that you might use when describing certain aspects of your work.
  12. XML has also been described as similar to a family structure. Families are generally composed of parents and children. Sometimes children have siblings. Children also tend to have certain attributes which distinguish them from other children within the same family.
  13. Okay, let’s go back to our previous example with MARC XML. Again, we have these different types of elements highlighted in certain ways. But now, let’s take our knowledge of XML structures and apply that to this slide.
  14. Now we’ve changed the element and attribute names to be obvious. We have a root element at the top and at the bottom. We have a parent element and within that, we have several child elements. Most of these child elements also contain certain attributes which tell us more about the tags in question.With this albeit simplified example, we start to see how working with XML files are not as difficult as we initially thought. We were encouraged by this line of thinking, but then we realized that we needed some example files from Jeremy before we started feeling like we were on the right track.
  15. Looking just at a typical MARC heading and comparing that with a typical XML heading, we see immediate similarities. When we normalize the MARC headings, we remove the punctuation and change the case. And with the XML headings, we roughly do the same procedure.Of course, there are still differences between the two types of headings. But normalization helps us to further see how we may be able to treat these headings, though they are in different formats and formatting.
  16. Next, we take one of these headings and attempt to treat it like a MARC heading. We do this because our system expects headings to start out in MARC. All of our reports and processing is centered around MARC.The normalized MARC version is above. With the XML version, since we don’t know off-hand what the individual subdivisions are, we force them all to $x. This is not technically correct in the case of the chronological subdivision. But, ultimately, you can see that when we normalize the heading, it doesn’t matter. Now, with a very minor difference, the two normalized headings are nearly identical.
  17. Let’s look at some more headings. We have the MARC equivalents up top and the XML versions down below.You’ll notice that the XML versions tend to have all of the headings smushed together into the same element. In theCONTENTdm files that we used from Jeremy, this was typical. So our first step was to extricate each individual heading from the group and treat them individually.
  18. Then we again attempt to convert them to a MARC type field. Again, there are minor differences between the fields after doing this conversion step.You’ll note that I keep highlighting “Religious ceremonies”. I’m doing that because “Religious ceremonies” is actually an out-of-date subject heading.
  19. So “Religious ceremonies” has an authority match against LC. In this case, it matches against a see-reference in the 450 field, which programmatically points back to the 1xx field. The 1xx field in the authority record represents the most current and valid form of the heading.
  20. So when we update the original XML heading (in the proper place), we replace “Religious ceremonies” with “Rites and ceremonies”.At the same time, we don’t want to lose access to any of the other older forms of headings that may exist for that authority match. But CONTENTdm doesn’t allow us to use an extra file and programming to do so would be beyond the scope of libraries using this service.What we proposed to do, then, is the crux of this entire service represented here in this slide. We update the original headings as needed, while also adding in these see-references in a separate subject-keyword index. Now, when users try to find this particular digital record they will be led to it regardless of whether they know to type in “Rites and ceremonies” or “Cults” or “Religious ceremonies” et cetera. Suddenly, your digital repositories open up access by orders of magnitude.
  21. With MARC format, we’re used to seeing specific tags which correspond to specific types of headings. With XML, it’s not uncommon for all of these types of elements to show up in the same XML file. So this service becomes much more of a collaborative process when working between Backstage and the Library. It could be that Jeremy wants his “subject” elements searched a different way or against a different database than some of his other elements.
  22. We ran across other issues while developing this service. One of the first was that CONTENTdm XML files tended to be missing container elements. Container elements help to offset individual records within an XML file. So what we did was temporarily add these container elements into the record.
  23. Then, after processing was complete, we made sure to remove those temporary elements. We needed to make sure that the post-processed file’s elements matched the pre-processed file’s elements.
  24. On this slide, you’ll notice we have several circles. Each circle represents a version of a file. In the center is the original file received from Jeremy. Only one of the surrounding circles is actually correct and can be loaded back into CONTENTdm without throwing up an error.On the left, we have title, creator, and subject-keyword. But you’ll notice that our original file does not have subject-keyword. So our first lesson is that element names cannot change between processing.Up top, we have title, subject, and creator. This is better as we still have the same elements, though the last two are in different order. Our second lesson is that elements cannot change order during processing.On the right, we have title, creator, subject, and keyword. So, no element name changes or order changes, but now we have keyword that we’ve added since we want to take advantage of those see-references. The third lesson is that we cannot introduce new elements when they did not exist previously.The bottom example matches the original exactly. The lesson here is that if we want to make use of new elements, we just have to make sure to add those in prior to exporting the XML file. CONTENTdm doesn’t care about empty elements just as long as they still exist when re-loading the processed files.
  25. Some of the other problems we encountered included bad OCR data. I should point out that this was not an issue with CONTENTdm per se; depending on the OCR program or data you’re trying to OCR, you’ll periodically run into garbage translations regardless.For our purposes, our system would throw up an error during processing when it encountered a stray angle bracket it wasn’t expecting to find. So it would treat them as new elements and that’s when the problems would start.
  26. In order to get around this issue, we offset the OCR data between CDATA blocks. This is both good &amp; bad. Good because it ignores all of the data between the CDATA blocks so the system can proceed as expected. Bad because if you have anything between those CDATA blocks that you want processed, that’s not possible. The hope is that your OCR data doesn’t need authority control performed on it at this point, but we admit that this is a workaround for now.
  27. At Backstage, we’re continually trying to push the envelope of what to report out to clients. Our philosophy is leaving no stone unturned when it comes to reports. This slide shows a few different reports.We report out unmatched names as well those where part of the heading actually matched. Headings that were changed completely are reported, along with headings that were split out into multiple sets of headings.
  28. This particular report is our near match report. We utilize a method called the Levenshtein-Distance algorithm, which compares two strings and determines the confidence value between those two strings. The closer the strings are distance-wise in characters, the higher the confidence value. The hope with this particular report is to give our clients more confidence in deciding what needs to be corrected in their headings … as well as what can be safely ignored. Based on feedback from other clients, this report is usually very helpful with Name headings in particular.
  29. Finally, I just wanted to add that the vast majority of our conversation up to this point has concerned itself with CONTENTdm XML files. We are anxious to also explore other non-MARC data formats, such as those listed here.
  30. Thank you for your time today.