1. Measuring library catalogs
ELAG 2018
Péter Király
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Card catalog at Gent University Library, photo: Pieter Morlion, 2010 CC-BY 4.0
https://commons.wikimedia.org/wiki/File:Boekentoren_2010PM_1179_21H9015.JPG
2. Part I. Introduction to MARC
❏ MAchine Readable Catalog
❏ format and semantic specification
❏ comes from the age of punchcards - information compression
❏ invented in early 60’s
❏ even the lapidary “MARC must die” article* celebrated its 15th anniversary
last year, but MARC is still living
❏ „There are only two kinds of people who believe themselves able to read a
MARC record without referring to a stack of manuals: a handful of our top
catalogers and those on serious drugs.”
* by Roy Tennnant http://lj.libraryjournal.com/2002/10/ljarchives/marc-must-die/
2
3. an example
LEADER 01136cnm a2200253ui 4500
001 002032820
005 20150224114135.0
008 031117s2003 gw 000 0 ger d
020 $a3805909810
100 1 $avon Staudinger, Julius,$d1836-1902$0(viaf)14846766
245 10$aJ. von Staudingers Kommentar zum ... /$cJ. von Staudinger.
250 $aNeubearb. 2003$bvon Jörn Eckert
260 $aBerlin :$bSellier-de Gruyter,$c2003.
300 $a534 p. ;.
500 $aCiteertitel: BGB.
500 $aBandtitel: Staudinger BGB.
700 1 $aEckert, Jörn
852 4 $xRE$bRE55$cRBIB$jRBIB.BUR 011 DE 021$p000000800147
3
4. Positional fields - Leader
00928nam a2200265 c 4500
0 1 2
01234 5 6 7 8 9 0 1 2345 6 7 8 9 0 1 2 3
00928|n|a|m| |a|2|2|0026|5| |c| |4|5|0|0
❏ LDR/0-4 Record length: ‘00928’ - is a number padding with 0-s (max. value: 99999)
❏ LDR/5 Record status: ‘n’ - is a dictionary term, means “new”
❏ LDR/6 Type of record: ‘a’ - is a dictionary term, means “Language material”
❏ LDR/7 Bibliographic level: ‘m’ - means “Monograph/Item”
❏ ...
4
5. Record type
Type of record Bibliographic level type
a a or c or d or m Books
a b or i or s Continuing Resources
t Books
c or d or i or j Music
e or f Maps
g or k or o or r Visual Materials
m Computer Files
p Mixed Materials
5
6. Positional fields - 008
‘801003s1958 ja 000 0 jpn ‘
0 1 2 3
012345 6 7890 1234 567 8901 23 4 5 67 8 9 0 1 2 34 567 8 9
‘801003|s|1958| |ja | | |#| |##|0|0|#|0|#|0 |jpn| | ‘
common for all types
part I
type specific part
common for all types
part II
6
7. Positional fields - 008
‘801003s1958 ja 000 0 jpn ‘
0 1 2 3
0123456789012345678901234567890123456789
aaaaaabccccddddeeefffgh All materials
IIIIjkLLLLmnopqr Books
ijklmnOOOpqrs Continuing Resources
iijklmNNNNNNOOp Music
IIIIjjklmnOO Maps
Iiijklmn Visual Materials
ijkl Computer Files
i Mixed Materials
lower case = distinct units
upper case = repeatable units
= undefined position
depends on record
type (calculated
from Leader values)
7
8. Datafields
repeatable/non-repeatable
Indicator1
Indicator2
Subfield1, ... , Subfieldn
always 1 char long dictionary term
❏ code
❏ value
❏ free text
❏ dictionary term
❏ fixed format (e.g. yymmdd)
❏ fixed format + dictionary terms (d7i2)
❏ fixed positions + dictionary terms
❏ repeatable/non-repeatable
8
9. Versions
❏ Changes of the standard
❏ No versioning
❏ New, deleted and changed elements every year
❏ Localized versions
❏ Introducing new fields
❏ Overwriting existing fields
❏ Mixing localized versions
❏ No notion about the localization
❏ 50+ localizations (international, national, consortial)
9
10. Handling versions (020, ISBN)
setSubfieldsWithCardinality(
"a", "International Standard Book Number", "NR",
"c", "Terms of availability", "NR",
"q", "Qualifying information", "R",
...
);
setHistoricalSubfields(
"b", "Binding information (BK, MP, MU) [OBSOLETE]"
);
putVersionSpecificSubfields(MarcVersion.DNB, Arrays.asList(
new SubfieldDefinition("9", "ISBN mit Bindestrichen", "R")
));
10
11. Addressing elements - MARCspec
XML: XPath﹣W3C standard
JSON: JSONPath﹣by Stefan Gössner (http://goessner.net/articles/JsonPath/)
MARC: MARCspec﹣by Carsten Klee (Zeitschriftendatenbank, Berlin)
❏ 260﹣field
❏ 245^2﹣the second indicator of a field
❏ 700[0]﹣the first instance of a field
❏ 245$c﹣a subfield
❏ 245$b{007/0=a|007/0=t}﹣subfield ‘b’ of field ‘245’, if character with
position ‘0’ of field 007 equals ‘a’ OR ‘t’.
❏ 020$c{$q=paperback}﹣subfield ‘c’ if subfield ‘q’ equals to ‘paperback’.
http://marcspec.github.io/MARCspec/marc-spec.html
11
12. Part II.
record validation
and quality assurance
Boekentoren UGent - de belvedère, photo: Michiel Hendryckx, 2013, CC-BY-SA 3.0
https://commons.wikimedia.org/wiki/File:Boekentoren_ugent_belvedere_675.jpg
12
13. preparation
cd ~/Prog
mkdir marc
cd marc
wget https://www.loc.gov/cds/downloads/MDSConnect/BooksAll.2014.part01.utf8.gz
gunzip BooksAll.2014.part01.utf8.gz
mv BooksAll.2014.part01.utf8 loc01.mrc
cd ../metadata-qa-marc/target
# optional:
wget https://github.com/pkiraly/metadata-qa-marc/releases/download/v0.2/metadata-qa-marc-0.2-
SNAPSHOT-jar-with-dependencies.jar
13
14. validating individual records
> ./validator ../marc/loc01.mrc
> less validation-report.txt
Error in ' 00000057 ':
082$ind1: obsolete value ' ' (https://www.loc.gov/marc/bibliographic/bd082.html)
Error in ' 00000119 ':
700$ind1: invalid value '2' (https://www.loc.gov/marc/bibliographic/bd700.html)
Error in ' 00000234 ':
082$ind1: obsolete value ' ' (https://www.loc.gov/marc/bibliographic/bd082.html)
Errors in ' 00000294 ':
050$ind2: obsolete value ' ' (https://www.loc.gov/marc/bibliographic/bd050.html)
260$ind1: obsolete value '0' (https://www.loc.gov/marc/bibliographic/bd260.html)
710$ind2: invalid value '0' (https://www.loc.gov/marc/bibliographic/bd710.html)
710$ind2: invalid value '0' (https://www.loc.gov/marc/bibliographic/bd710.html)
14
15. summary of errors
> ./validator --summary ../marc/loc01.mrc
> less validation-report.txt
006/01-04 (tag006book01): contains invalid code 'n' in ' n '
(https://www.loc.gov/marc/bibliographic/bd006.html) (1 times)
...
020$a: invalid ISBN '052179296X' is not a valid ISBN 10 value'
(https://en.wikipedia.org/wiki/International_Standard_Book_Number) (1 times)
...
045$a: invalid lengh 'v v': length is not 4 char' (https://www.loc.gov/marc/bibliographic/bd045.html)
(1 times)
…
740$ind2: obsolete value '1' (https://www.loc.gov/marc/bibliographic/bd740.html) (22 times)
15
16. Specifying the MARC version
> ./validator --marcVersion “GENT” [file]
Currently supported versions:
★ MARC21 -- Library of Congress MARC21
★ DNB -- Deuthche Nationalbibliothek's MARC version
★ OCLC -- OCLCMARC
★ GENT -- Gent University (Belgium)
★ SZTE -- Szegedi Tudományegyetem (Hungary)
★ FENNICA -- the Fennica catalog of Finnish National Library
16
17. output format
> ./validator --format “tab-separated” ../marc/loc01.mrc
Options:
★ text
★ tab-separated
★ comma-separated
★ json*
* in progress
17
18. default record type
SEVERE: Error with record '002066968'. Leader/06 (typeOfRecord): 'n',
Leader/07 (bibliographicLevel): 'm'
> ./validator --defaultRecordType “BOOKS” ../marc/loc01.mrc
★ BOOKS
★ CONTINUING_RESOURCES
★ MUSIC
★ MAPS
★ VISUAL_MATERIALS
★ COMPUTER_FILES
★ MIXED_MATERIALS
18
22. processing a subset of records
process 1-1000th records
> ./validator --limit 1000 ../marc/loc01.mrc
process 5001-6000th records
> ./validator --offset 5000 --limit 1000 ../marc/loc01.mrc
22
23. Fix ALEPHSEQ placeholder '^'
ALEPH export contains '^' characters instead spaces in control fields (006, 007,
008). This flag replace them to spaces before the validation
./validator --fixAlephseq [file]
23
24. viewing/selecting records
Displaying record with given ID
> ./formatter --id “002032820” ../marc/loc01.mrc
Displaying Nth record
> ./formatter --countNr 1 ../marc/loc01.mrc
Displaying records matching a query
> ./formatter --search ‘260$c=1899.’ ../marc/loc01.mrc
Retrieve given elements
> ./formatter --selector ‘245$c’ ../marc/loc01.mrc
24
25. > ./formatter --selector ‘245$c’ ../marc/loc01.mrc
By S. H. Aurand.
by Charles E. Chadman.
> ./formatter --selector ‘245$c’ --withId ../marc/loc01.mrc
00000002 By S. H. Aurand.
00000004 by Charles E. Chadman.
> ./formatter --selector ‘245$a;245$c’ --withId ../marc/loc01.mrc
00000002 1899. By S. H. Aurand.
00000004 1899. by Charles E. Chadman.
extract given elements
25
26. calculating Thompson-Traill completeness
Thompson and Traill (2017) Leveraging Python to improve ebook metadata selection, ingest, and management
(Code4Lib Journal 38, http://journal.code4lib.org/articles/12828) 26
27. calculating Thompson-Traill completeness
./tt-completeness ../marc/loc01.mrc
options: limit, offset, fileName, nolog
> less tt-completeness.csv
id,ISBN,Authors,Alternative Titles,Edition,Contributors,Series,TOC,Date
008,Date 26X,LC/NLM,LoC,Mesh,Fast,GND,Other,Online,Language of
Resource,Country of Publication,noLanguageOrEnglish,RDA,total
"010002197",0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,4
"01000288X",0,0,1,0,0,1,0,1,2,0,0,0,0,0,0,0,0,0,0,0,5
"010004483",0,0,1,0,0,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,5
"010018883",0,0,0,0,1,0,0,1,2,0,0,0,0,0,0,0,1,1,0,0,6
27
28. K-means clustering
Spark (Scala)
increasing number of clusters
decreasing the distance from
the centroids
after a point this gain is not
so big (“elbow effect”) -- in
theory
Big number or low
quality records
small clusters with ‘in
between’ quality records
the acceptable average
clusters with good quality
records
28
30. Indexing with Solr
"marc-tags" format
"100a_ss": "Jung-Baek, Myong Ja",
"100ind1_ss": "Surname",
"245c_ss": "Vorgelegt von Myong Ja Jung-Baek."
"human-readable" format
"MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"MainPersonalName_type_ss": "Surname",
"Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
"mixed" format
"100a_MainPersonalName_personalName_ss": "Jung-Baek, Myong Ja",
"100ind1_MainPersonalName_type_ss": "Surname",
"245a_Title_responsibilityStatement_ss": "Vorgelegt von Myong Ja Jung-Baek."
30
How
to
name
the
fields?
33. Finding problems with facets
Vandenhoeck und Ruprecht
Vandenhoeck & Ruprecht
Vandenhoeck u. Ruprecht
Vandenhoeck
Vandenhoek & Ruprecht
Vandenhoek und Ruprecht
Bandenhoed und Ruprecht
Vandenhoeck et Ruprecht
Vandenhoeck & Reprecht
Vandenhoed und Ruprecht
V&R unipress
V&R Unipress
V & R Unipress
V & R unipress
33
39. available catalogs to measure
39
❏ Library of Congress
❏ Harvard University Library
❏ Columbia University Library
❏ Deutsche Nationalbibliothek
❏ Universiteitsbibliotheek Gent
❏ Bibliotheksservice-Zentrum Baden Würtemberg
❏ Bibliotheksverbundes Bayern
❏ University of Michigan Library
❏ Toronto Public Library
❏ Leibniz-Informationszentrum Technik und Naturwiss. Universitätsbibliothek (TIB)
❏ Répertoire International des Sources Musicales
❏ ETH-Bibliothek (Swiss Federal Institute of Technology in Zurich)
❏ British library
❏ Talis
https://github.com/pkiraly/metadata-qa-marc#datasources
40. Authority entries
Responsibility statement:
Herr Seele (tekeningen); Toon Coussement (foto's); Peter Claes, Kris Coremans
en Hera Van Sande, vakgroep architectuur en stedenbouw Universiteit Gent
(vormgeving).
Authority entries:
❏ Herr Seele
❏ Coussement, Toon
❏ Claes, Peter
❏ Van Sande, Hera
40
41. everything else
… at least regarding to this project
https://github.com/pkiraly/metadata-qa-marc
https://twitter.com/kiru
peter.kiraly@gwdg.de
41