This document discusses best practices and lessons learned for data wrangling projects. It emphasizes starting projects by defining goals and intended outcomes. Common challenges discussed include inconsistent data encoding, missing fields, and unexpected data issues. The document provides tips for investigating data exports, mapping fields, and using tools like MarcEdit to clean data when problems arise. The overall message is that real-world library data is messy and may vary across systems.
2. You do what?
Liaisoning between tech services, project team,
and vendors on data manipulation and display
Skills:
− Marc and ILS data migration/manipulation
− Nitty Gritty details – hows and whys
− Knowledge sharing between partners
− Investigations and Implementations
− Project management
− Meeting management
3.
4.
5.
6. Data driven? Start at the end!
What do you really want to know?
Do you have the data to answer that?
What are you going to do with the data
What is interesting vs. what is actionable
Test out your theories!!
8. Data driven? Start at the end!
Comparisons across institutions – match points
Started with an OCLC reclamation project
Records Sent Returned Unresolved Updated
OCLC #
Ursus 2,100,299 13,232 171,474
Colby 474,438 373
26,334
Bowdoin 624,164
37,848
Bates 656,926
25,101
TOTALS 3,855,827 13,605
260,757
9. Start at the end...if your ordering out
Think about what you want to get back, make
sure it goes out.
HOW will you deal with returned data?
Can all the partners do the same things in terms
of processing?
10. Lists, lists, lists!
What will you in/exclude if you are extracting:
types: gov docs, serials, media, e-resources
locations: ref, off-site, reserve, special collections
status: billed, missing, suppressed, withdrawn (!)
use: circ, internal use, reserves
What constitutes a circulating copy?
How are the above encoded?
Can you get what you want?
11. Circ Data
How long has it been retained?
Any tech processing that included circing?
Has it ever been cleared?
(… and what does it really tell you ...)
12. Know your vendor / programmer
What exactly is going to happen to the data,
and what will be in(ex)cluded?
Leader bib level m , s
Gov Doc? (008 / 28) ?
Printed material? Media?
14. Can you get it out?
Export Tables
What exactly is exported
What do they do with weird data? (b b, b 930)
Do the add any data? v.v.29 , oclc prefix
Formats of dates
17. … a few of the ugly things we saw...
Multiple fields used for internal use (INTL
USE, COPY USE, and IUSE3)
Records with multiple 001s
Records with multiple barcodes, duplicate
barcodes, bound with items
Barcodes in 949 not 'b'
Records with no 260
3 0000003 ocm3 3_
18. Your data through different lenses
Points of departure:
-Merged 001s
-FRBR
-Volume vs Title counts
-Unique vs Holdings counts
-Date of data used
-Definition of public domain
20. One more reason to thank Terry Reese
SELECT T0xx.field_data
FROM T0xx, T9xx
WHERE T9xx.field = '945'
AND T9xx.subfield = "f"
AND T9xx.field_data > 0
AND T0xx.cid = T9xx.cid
AND T0xx.field = '001'
21. Data Wrangling: MSCS Side
Closing Haiku:
Data is messy
While it can be normalized
Nothing is perfect
Notes de l'éditeur
Easy to say “we want detailed subject analysis and title lists” but if you don't have the staff time to review, does this really matter? Try to have a clear picture BEFORE starting the project. (Data can go stale … interest vs actionable data)
Easy to say “we want detailed subject analysis and title lists” but if you don't have the staff time to review, does this really matter? Try to have a clear picture BEFORE starting the project. (Data can go stale … interest vs actionable data)
Can you get what you want in a way that is meaningful to the vendor / programmer?
Do you have enough for it have value (some question if it has value at all..) Did it get checked out to processing? Another example is getting lists of barcodes into review file – ran into this where odd internal use data in different fields Do you really want to rely on it? – That 1980's Word Perfect manual vs. Portuguese poetry
You've decided what you want and you've pulled all your data … and ?? do you know how it's going to be processed.
Variations in cataloging practices over time and space Lots of oddities – no 260, no 001, multiple 001s …
Internal Use Circ in different field – different catalog
Sent data to three different places (again document what went where!)
Data is messy Nothing is ever perfect Please do not despair