This document discusses authority control for digital collections. It explains that authority control provides vocabulary control through consistency, standardization, and an updating mechanism. It gives examples of inconsistencies in subject headings and names that authority control would help resolve. The document also discusses how authority control involves partnerships between institutions to share authority files and processing. It provides examples of raw MARC records and how they appear in different display formats like MARC and MARC XML. It also gives a brief history of MARC and discusses how XML defines elements.
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Authority Control for Digital Collections Metadata
1. Authority Control for
Digital Collections
October
18, 2013
Jeremy Myntti
Head of Cataloging & Metadata Services
Nate Cothran
Vice President, Automation Services
3. Inconsistency
Gosiute Indians
Goshute Indians
Navajo Indians
Navaho Indians
Salt Lake
Salt Lake City
Salt Lake City (Utah)
Bishop, Dail Stapley
Bishop, Dale Stapely
Bishop, Dale Stapley
Beckwith, Frank A. (1876-1951)
Beckwith, Frank Asahel (18761951)
Beckwith, Frank A.
Beckwith, Frank A. (1876-1951)
Beckwith, Frank Asahel (1876-1951)
Beckwith, Frank Asahel, 1876-1951
Hansen, Henrie
Hansen, Henry
Hansen, Henry Daniel, 1896-1979
3
7. MARC Display
001
ovld0123456789
…
100 1
245 12
…
505
650
0
$a Mercer, Leigh, $d 1893-1977.
$a A man, a plan, a canal – Panama!
$a Do geese see God? – Murder for a jar of
red rum – Some men interpret nine memos –
Never odd or even.
$a Palindromes
7
8. MARC as a Container
Portable
Entrenched
Tools
8
10. MARC Display
001
ovld0123456789
…
100 1
245 12
…
505
650
0
$a Mercer, Leigh, $d 1893-1977.
$a A man, a plan, a canal – Panama!
$a Do geese see God? – Murder for a jar of
red rum – Some men interpret nine memos –
Never odd or even.
$a Palindromes
10
11. MARC XML
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
<leader>00491nam 22001571 4500</leader>
<controlfield tag="001">ovld000000001</controlfield>
…
<datafield tag="100" ind1="1" ind2=" "><subfield code="a">Mercer, Leigh,</subfield>
<subfield code="d">1893-1977. </subfield></datafield>
<datafield tag="245" ind1="1" ind2=“2"><subfield code="a">A man, a plan, a canal -- Panama!
</subfield></datafield>
…
<datafield tag="505" ind1=" " ind2=" "><subfield code="a">Do geese see God? -- Murder for a
jar of red rum -- Some men interpret nine memos -- Never odd or even.</subfield></datafield>
<datafield tag="650" ind1=" " ind2="0"><subfield code="a">Palindromes.</subfield></datafield>
</record>
</collection>
11
12. MARC XML
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
<leader>00491nam 22001571 4500</leader>
<controlfield tag="001">ovld000000001</controlfield>
…
<datafield tag="100" ind1="1" ind2=" "><subfield code="a">Mercer, Leigh,</subfield>
<subfield code="d">1893-1977. </subfield></datafield>
<datafield tag="245" ind1="1" ind2=“2"><subfield code="a">A man, a plan, a canal -- Panama!
</subfield></datafield>
…
<datafield tag="505" ind1=" " ind2=" "><subfield code="a">Do geese see God? -- Murder for a
jar of red rum -- Some men interpret nine memos -- Never odd or even.</subfield></datafield>
<datafield tag="650" ind1=" " ind2="0"><subfield code="a">Palindromes.</subfield></datafield>
</record>
</collection>
12
13. XML Defined
XML stands for eXtensible Markup Language
XML is similar to HTML, but also different
XML was designed to carry data, not display data
XML tags are not pre-defined; you must define tags
XML is designed to be self-descriptive
XML is a W3C recommendation
13
14. XML Invented
There are no defined XML tags (title = <title>)
Tags in HTML are defined (<p> = paragraph)
Anyone can design their own XML schema
Common library-centric XML schemas include:
◦ MODS, MARC XML, EAD, Dublin Core, ONIX, XC
◦ METS Alto, TEI, BIBFRAME / RDF, CONTENTdm
◦ All of these either follow their own schema or borrow
elements from each other
14
17. MARC XML
<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
<record>
<leader>00491nam 22001571 4500</leader>
<controlfield tag="001">ovld000000001</controlfield>
…
<datafield tag="100" ind1="1" ind2=" "><subfield code="a">Mercer, Leigh,</subfield>
<subfield code="d">1893-1977. </subfield></datafield>
<datafield tag="245" ind1="1" ind2=“2"><subfield code="a">A man, a plan, a canal -- Panama!
</subfield></datafield>
…
<datafield tag="505" ind1=" " ind2=" "><subfield code="a">Do geese see God? -- Murder for a
jar of red rum -- Some men interpret nine memos -- Never odd or even.</subfield></datafield>
<datafield tag="650" ind1=" " ind2="0"><subfield code="a">Palindromes.</subfield></datafield>
</record>
</collection>
17
18. MARC XML, Parent / Child
<?xml version="1.0" encoding="UTF-8"?>
<root xmlns="http://www.loc.gov/MARC21/slim">
<parent>
<child>00491nam 22001571 4500</child>
<child att="001">ovld000000001</child>
…
<child att="100" att="1" att=" "><child att="a">Mercer, Leigh,</child><child att="d">1893-1977.
</child></child>
<child att="245" att="1" att= “2"><child att="a">A man, a plan, a canal -- Panama!</child></child>
…
<child att="505" att=" " att=" "><child att="a">Do geese see God? -- Murder for a jar of red rum
-- Some men interpret nine memos -- Never odd or even.</child></child>
<child att="650" att=" " att="0"><child att="a">Palindromes.</child></child>
</parent>
</root>
18
19. Methodology
Extracting data from CONTENTdm
◦ Stop updates to collection and make it read-only
◦ Make copy of desc.all metadata file for backup
◦ Run desc.all file through AC processing
◦ Replace desc.all file on CONTENTdm server
◦ Run the full collection index
◦ Remove read-only status from collection
<creata>Ogburn, Joyce L.</creata>
<title>Acquiring minds want to know: digital scholarship</title>
<date>2003</date>
<descri>A new form of scholarship has emerged in recent years named "digital
scholarship.“</descri>
<type>text;</type>
<subjec>Born digital; Libraries; Electronic publishing</subjec>
<subjea>Libraries and electronic publishing; Scholarly electronic publishing
</subjea>
19
20. Normalized Headings
MARC
600 $a Smith, John, $d 1947 Apr. 16651 $a United States $x History $y 20th century.
600 $ SMITH, JOHN $ 1947 APR 16
651 $ UNITED STATES $ HISTORY $ 20TH CENTURY
XML
<subject>Smith, John, 1947 Apr. 16-;
United States--History--20th century;</subject>
<subject>SMITH, JOHN, 1947 APR 16;
UNITED STATES--HISTORY--20TH CENTURY;</subject>
20
21. MARC & XML, Normalized
MARC
◦ 651 $a United States $x History $y 20th century.
◦ 651 $ UNITED STATES $ HISTORY $ 20TH CENTURY
XML
◦
United States--History--20th century;
◦ 650 $a United States $x History $x 20th century
◦ 650 $ UNITED STATES $ HISTORY $ 20TH CENTURY
21
22. MARC & XML Examples
MARC
◦ 600 $a Smith, John, $d 1947 Apr. 16◦ 651 $a United States $x History $y 20th century.
◦ 650 $a Navajo Indians $x History.
◦ 650 $a Religious ceremonies.
◦ 650 $a Religion.
XML
◦ Smith, John, 1947 Apr. 16-; United States--History--20th century;
Navajo Indians--History; Religious ceremonies; Religion;
22
23. MARC & XML Conversions
MARC
◦ 600 $a Smith, John, $d 1947 Apr. 16-
◦ 651 $a United States $x History $y 20th century.
◦ 650 $a Navajo Indians $x History.
◦ 650 $a Religious ceremonies.
◦ 650 $a Religion.
XML
◦ 600 $a Smith, John, $d 1947 Apr. 16◦ 650 $a United States $x History $x 20th century
◦ 650 $a Navajo Indians $x History
◦ 650 $a Religious ceremonies
◦ 650 $a Religion
23
24. XML Authority Matching
XML
◦ 600 $a Smith, John, $d 1947 Apr. 16◦ 650 $a United States $x History $x 20th century
◦ 650 $a Navajo Indians $x History
◦ 650 $a Religious ceremonies
◦ 650 $a Religion
Authority
◦
◦
◦
◦
◦
◦
◦
◦
◦
150 $a Rites and ceremonies
450 $a Ceremonies
450 $a Cult
450 $a Cultus
450 $a Ecclesiastical rites and ceremonies
450 $a Religious ceremonies
450 $a Religious rites
450 $a Rites of passage
450 $a Traditions
24
25. XML Authority Control
XML
◦ <subject>Smith, John, 1947 Apr. 16-; United States-History--20th century; Navajo Indians--History; Rites
and ceremonies; Religion;</subject>
Variants
◦ <subject-keyword>Ceremonies; Cult; Cultus;
Ecclesiastical rites and ceremonies; Religious
ceremonies; Religious rites; Rites of passage;
Traditions;</subject-keyword>
25
26. MARC vs XML Tags
MARC
◦ 600, 610, 611, 630, 650, 651, 655
XML
◦ <subject>, <subjeca>, <subjec-1>, <subjecttitle>, <topics>, <entry>, <creata>, <s01>, <genre>, e
tc
26
29. CDM Before & After
after
title
subject
creator
after
title
creator
subjectkeyword
before
title
creator
subject
after
title
creator
subject
keyword
after
title
creator
subject
29
30. Other Issues
Bad OCR data
<descr> or <transc> field elements
Errant angle brackets: < or >
CDATA workaround (pre-processing)
<transc>...
immunofluorescence at
sites of actin-membrane<
f'' " • - t . "U . * • ?vl />/>
. * .. O' ' .< .* V#•* * •• * r , .
* . y1 •*
...</transc>
• System tries to assign
< … > as discrete child
of transc
30
31. Other Issues
Bad OCR data
<descr> or <transc> field elements
Errant angle brackets: < or >
CDATA workaround (pre-processing)
• CDATA blocks effectively
cancel out all data between
them
• CDATA = character data
<![CDATA[<transc>...
immunofluorescence at
sites of actin-membrane<
f'' " • - t . "U . * • ?vl />/>
. * .. O' ' .< .* V#•* * •• * r , .
* . y1 •*
...</transc>]]>
31
32. Statistics for USpace
Changed Records
Matched Records
Changed
35%
Subjects
33%
Records
6,308
Records
15,800
Matched Records
Names
28%
Records
22,306
32
44. Questions
Jeremy Myntti | jeremy.myntti@utah.edu
Head of Cataloging & Metadata Services
Nate Cothran | nate@bslw.com
Vice President, Automation Services
44
Notes de l'éditeur
At Backstage, we are very conversant with MARC. We’ve been dealing with it for over 25 years. It’s comfortable for us.So when Jeremy first approached us about this project, it wasn’t without some trepidation & anxiety on our side. Performing authority control on MARC files is one thing, but attempting to do so on non-MARC files is a different order of complexity.This whole time I’ve been talking, we have a MARC record up on the screen. Chances are we’re not any closer to figuring out what it means than when we first started. But sometimes it helps to take a step back and look at things in a different light. We can peel back the layers and notice patterns that didn’t exist previously.
In this case, we change the color of the existing code a little. Now we can make out the fields and data a little more clearly. It’s this kind of mentality we used when we approached this project with Jeremy. We led with the idea of what if what was desired was not as difficult as we first imagined it?
Even with MARC, this is how we’re really used to seeing things, isn’t it? This layout makes much more sense to us as most of our utilities or tools make use of this display.
MARC itself has been a perfect container over the years. MARC is like an old friend where you can stop in, have a cup a coffee, and start up a conversation. Then, in the middle of that conversation, you can get up and leave for 10 years. When you come back, you can pick up that old conversation right where you left off.We can do this because MARC hasn’t changed during those years. It’s the same format we’re used to seeing. More importantly, our systems and tools know what to expect with MARC format because it follows the same conventions it always has.I can create a MARC file, zip it up, attach it to an email, or give you a thumb-drive and it’s ready for you right then and there. You can take it from me, manipulate it, then send it back over and we’re both happy.
MARC had its start in the 1960s and continued from there. By contrast, XML started in 1996. MARC is useful for print materials, while XML is good for online materials. In particular, RDA is more suited for online environments.I have a question for MARC, which is, whether or not it will continue to be relevant down the road. MARC has its own limitations built in, unfortunately. Records are limited to finite sizes, while XML is expandable, malleable, dynamic.
Okay, let’s go back to this example. This is how we’re used to seeing MARC. But let’s transform this into a typical MARC XML file.
Like the code for raw MARC, this code for MARC XML can look a little busy. Without taking that initial step back, we might feel a little lost in trying to interpret what we’re seeing.
But when we highlight elements and attributes in certain ways, we start to see MARC-related data that we are already familiar with.
XML is similar to HTML in that both contain elements and tags. XML is meant to carry data while HTML is meant to display that data. If you open up your favorite web browser, then navigate to a website you like, when you look at the underlying structure or source of that website, it is in HTML.XML, on the other hand, is more likely used to carry the data to the HTML page. XML is also invented according to whoever is writing or maintaining it.
At the same time, there is no core set of tags that must always be used when describing common elements. There are suggestions in place, of course, but ultimately it is up to the organization or the individual generating the XML code. You could start your own XML schema today and have people using it tomorrow, whereas with MARC you are constrained by what already exists.
XML has been described as being similar to a tree. A tree has roots, branches, and leaves. Leaves represent more details that you might use when describing certain aspects of your work.
XML has also been described as similar to a family structure. Families are generally composed of parents and children. Sometimes children have siblings. Children also tend to have certain attributes which distinguish them from other children within the same family.
Okay, let’s go back to our previous example with MARC XML. Again, we have these different types of elements highlighted in certain ways. But now, let’s take our knowledge of XML structures and apply that to this slide.
Now we’ve changed the element and attribute names to be obvious. We have a root element at the top and at the bottom. We have a parent element and within that, we have several child elements. Most of these child elements also contain certain attributes which tell us more about the tags in question.With this albeit simplified example, we start to see how working with XML files are not as difficult as we initially thought. We were encouraged by this line of thinking, but then we realized that we needed some example files from Jeremy before we started feeling like we were on the right track.
Looking just at a typical MARC heading and comparing that with a typical XML heading, we see immediate similarities. When we normalize the MARC headings, we remove the punctuation and change the case. And with the XML headings, we roughly do the same procedure.Of course, there are still differences between the two types of headings. But normalization helps us to further see how we may be able to treat these headings, though they are in different formats and formatting.
Next, we take one of these headings and attempt to treat it like a MARC heading. We do this because our system expects headings to start out in MARC. All of our reports and processing is centered around MARC.The normalized MARC version is above. With the XML version, since we don’t know off-hand what the individual subdivisions are, we force them all to $x. This is not technically correct in the case of the chronological subdivision. But, ultimately, you can see that when we normalize the heading, it doesn’t matter. Now, with a very minor difference, the two normalized headings are nearly identical.
Let’s look at some more headings. We have the MARC equivalents up top and the XML versions down below.You’ll notice that the XML versions tend to have all of the headings smushed together into the same element. In theCONTENTdm files that we used from Jeremy, this was typical. So our first step was to extricate each individual heading from the group and treat them individually.
Then we again attempt to convert them to a MARC type field. Again, there are minor differences between the fields after doing this conversion step.You’ll note that I keep highlighting “Religious ceremonies”. I’m doing that because “Religious ceremonies” is actually an out-of-date subject heading.
So “Religious ceremonies” has an authority match against LC. In this case, it matches against a see-reference in the 450 field, which programmatically points back to the 1xx field. The 1xx field in the authority record represents the most current and valid form of the heading.
So when we update the original XML heading (in the proper place), we replace “Religious ceremonies” with “Rites and ceremonies”.At the same time, we don’t want to lose access to any of the other older forms of headings that may exist for that authority match. But CONTENTdm doesn’t allow us to use an extra file and programming to do so would be beyond the scope of libraries using this service.What we proposed to do, then, is the crux of this entire service represented here in this slide. We update the original headings as needed, while also adding in these see-references in a separate subject-keyword index. Now, when users try to find this particular digital record they will be led to it regardless of whether they know to type in “Rites and ceremonies” or “Cults” or “Religious ceremonies” et cetera. Suddenly, your digital repositories open up access by orders of magnitude.
With MARC format, we’re used to seeing specific tags which correspond to specific types of headings. With XML, it’s not uncommon for all of these types of elements to show up in the same XML file. So this service becomes much more of a collaborative process when working between Backstage and the Library. It could be that Jeremy wants his “subject” elements searched a different way or against a different database than some of his other elements.
We ran across other issues while developing this service. One of the first was that CONTENTdm XML files tended to be missing container elements. Container elements help to offset individual records within an XML file. So what we did was temporarily add these container elements into the record.
Then, after processing was complete, we made sure to remove those temporary elements. We needed to make sure that the post-processed file’s elements matched the pre-processed file’s elements.
On this slide, you’ll notice we have several circles. Each circle represents a version of a file. In the center is the original file received from Jeremy. Only one of the surrounding circles is actually correct and can be loaded back into CONTENTdm without throwing up an error.On the left, we have title, creator, and subject-keyword. But you’ll notice that our original file does not have subject-keyword. So our first lesson is that element names cannot change between processing.Up top, we have title, subject, and creator. This is better as we still have the same elements, though the last two are in different order. Our second lesson is that elements cannot change order during processing.On the right, we have title, creator, subject, and keyword. So, no element name changes or order changes, but now we have keyword that we’ve added since we want to take advantage of those see-references. The third lesson is that we cannot introduce new elements when they did not exist previously.The bottom example matches the original exactly. The lesson here is that if we want to make use of new elements, we just have to make sure to add those in prior to exporting the XML file. CONTENTdm doesn’t care about empty elements just as long as they still exist when re-loading the processed files.
Some of the other problems we encountered included bad OCR data. I should point out that this was not an issue with CONTENTdm per se; depending on the OCR program or data you’re trying to OCR, you’ll periodically run into garbage translations regardless.For our purposes, our system would throw up an error during processing when it encountered a stray angle bracket it wasn’t expecting to find. So it would treat them as new elements and that’s when the problems would start.
In order to get around this issue, we offset the OCR data between CDATA blocks. This is both good & bad. Good because it ignores all of the data between the CDATA blocks so the system can proceed as expected. Bad because if you have anything between those CDATA blocks that you want processed, that’s not possible. The hope is that your OCR data doesn’t need authority control performed on it at this point, but we admit that this is a workaround for now.
At Backstage, we’re continually trying to push the envelope of what to report out to clients. Our philosophy is leaving no stone unturned when it comes to reports. This slide shows a few different reports.We report out unmatched names as well those where part of the heading actually matched. Headings that were changed completely are reported, along with headings that were split out into multiple sets of headings.
This particular report is our near match report. We utilize a method called the Levenshtein-Distance algorithm, which compares two strings and determines the confidence value between those two strings. The closer the strings are distance-wise in characters, the higher the confidence value. The hope with this particular report is to give our clients more confidence in deciding what needs to be corrected in their headings … as well as what can be safely ignored. Based on feedback from other clients, this report is usually very helpful with Name headings in particular.
Finally, I just wanted to add that the vast majority of our conversation up to this point has concerned itself with CONTENTdm XML files. We are anxious to also explore other non-MARC data formats, such as those listed here.