A North Carolina Connecting to Collections (C2C) workshop co-taught by Audra Eagle Yun (WFU), Nicholas Graham (UNC), and Lisa Gregory (State Archives of NC). This workshop took place on June 13, 2011 in Wilson, NC.
1. Preparing for a Digitization Project Wilson County Public Library June 13, 2011 Nicholas Graham Lisa Gregory Audra Eagle Yun
2. Agenda Welcome and Introductions About Connecting to Collections 10:00 – 10:45 Planning for a Digital Project Selecting and Evaluating Materials Copyright 10:45 – 11:15 Digitization Equipment and Expertise Standards and Guidelines 11:15 – 12:00 Description Evaluating Metadata Needs Metadata Standards and Controlled Vocabularies Creating a Data Dictionary 12:00 – 1:00 Lunch 1:00 – 1:30 Digital Publishing Free and Cheap Options Open Source and Homegrown Options CONTENTdm
3. Agenda, continued 1:30 – 2:00 Digital Preservation Long-term Care for your Digital Files 2:00 – 2:30 North Carolina Digital Heritage Center Services Offered by the NC Digital Heritage Center How to Develop a Project with the Digital Heritage Center 2:30 – 3:00 Questions and Discussion
29. Metadata Standards: Examples Name Focus Description DDI Archiving and Social Science Data Documentation Initiative is an international effort to establish a standard for technical documentation describing social science data. A membership-based Alliance is developing the DDI specification, which is written in XML. EAD Archives Encoded Archival Description - a standard for encoding archival finding aids using XML in archival and manuscript repositories. CDWA Arts and Museums Categories for the Description of Works of Art is a conceptual framework for describing and accessing information about works of art, architecture, and other material culture. VRA Core Arts & Musuems Visual Resources Association – the standard provides a categorical organization for the description of works of visual culture as well as the images that document them. Darwin Core Biology Darwin Core is a metadata specification for information about the geographic occurrence of species and the existence of specimens in collections. TEI Humanities, social sciences & linguistics Text Encoding Initiative - a standard for the representation of texts in digital form, chiefly in the humanities, social sciences and linguistics. NISO MIX Images Z39.87 Data dictionary - technical metadata for digital still images (MIX) - NISO Metadata for Images in XML is an XML schema for a set of technical data elements required to manage digital image collections. MARC Librarianship MARC - MAchine Readable Cataloging - standards for the representation and communication of bibliographic and related information in machine-readable form. METS Librarianship Metadata Encoding and Transmission Standard - an XML schema for encoding descriptive, administrative, and structural metadata regarding objects within a digital library. MODS Librarianship Metadata Object Description Schema - is a schema for a bibliographic element set that may be used for a variety of purposes, and particularly for library applications. XOBIS Librarianship XML Organic Bibliographic Information Schema - a XML schema for modeling MARC data. MPEG-7 Multimedia MPEG-7 is a ISO/IEC standard and specifies a set of descriptors to describe various types of multimedia information and is developed by the Moving Picture Experts Group. Dublin Core Networked resources Dublin Core - interoperable online metadata standard focused on networked resources.
34. Thesauri: Examples Thesaurus of geographic names (TGN) Constantinople (USE İstanbul) İstanbul (USE FOR Constantinople) Performing arts BT: arts (broad discipline) NT: dance Art & Architecture Thesaurus (AAT)
39. Assigning Metadata “ New York & bridges from Brooklyn” c. 1913 gelatin silver photographic print 9.5x34” http://www.flickr.com/photos/library_of_congress/4484597234/ How do we describe the format ?
54. Example Metadata Record Title: Jim Graham with bull, Meadows Domino 66 Creator: Fern, Douglas M. Date: 1969 Subject: Fairs; Livestock shows; Politicians; Automobiles; Hay; Graham, James A., 1921-; Description: Jim Graham, Superintendent of Beef Stock at the North Carolina State Fair stands with the 16 month-old Meadows Domino 66. The bull was sold by J. Horton Doughton of Doughton Meadows Farm, Laurel Springs, N.C., to Mr. and Mrs. A.W. Fanjoy of Joy Acres Farm, Statesville, N.C., for $10,000. It won every class ever shown in except one and received second place there, and was the Grand Champion at the N.C. State Fair. At the time, it was the highest priced bull ever sold in N.C. Weight: 1450 lbs. Time Period [Coverage]: 20th century Location [Coverage]: Raleigh, N.C. (Wake County) Format: image/jpeg; 721 KB Type: Image Rights: This image may be under copyright. Please contact INSTITUTION NAME for permission to reproduce. Identifier: agcoll_17.11.189.jpg Capture Date [Date]: 2011-06-05 Capture Tools: Epson Expression 10000XL; Metadata Creator: Gregory, Lisa
As curators of collections, we know the importance of description. Same thing applies online, perhaps more, with dominance of Google and increased real or perceived search saavy.
Coverage = geographic, temporal
Organized – A-Z Index + retrieve – faster and more relevant Preferred Limited scope Specific domain
Two thesauri examples that show some of these principles in action.
LOC partnered with Flickr in 2008 with the idea of a “Commons,” putting up over 3100 photos and encouraging users to tag them. The idea – they’ve got sooooo much stuff, and people want access to it online, yet doing full description by trained professionals is the big bottleneck in the system. So why not leverage the crowd? Here’s what they got.
They’ve been continuing to add a lot of content to flickr, and here’s an example. I’ve pulled a picture, and on the left is the LOC metadata. On the right are the user-designated tags. Highlighted are the ones that overlap – you can see it isn’t many. Call attention to: “deltacounty” (one word), “needle in a haystack” (idioms), “tyre” (alternate spellings)
Pretend we’re with a public library, which probably has to consider a pretty broad audience. Plus, this is material that might be used by kids, but isn’t specifically for kids. But we don’t want anything with acronyms – want terms that can be used by laypeople. Let’s go to a controlled vocabulary source, the Getty Research Institute’s Art & Architecture Thesaurus. Popular for describing visual materials. First, search using what we already know. Gelatin silver photograph print.
That’s the first term. What else?
It’s a wide photo – panorama?
City – starting to get into subject, but I’ll try it anyway
Cityscapes? Yes or no – judgment call. The other thing I wanted to point out is that it’s often helpful to browse your CV, especially if you’re not familiar with the topic. If you click on the little triangle of boxes in the AAT, you can do that.
So once I click on Visual works “Guide Term,” you can browse through the list. I see “photographs” and “photographs by form: color” which brings me to another applicable format term I could use – black-and-white photographs. (Guide term – a term used to collocate like concepts, but shouldn’t be applied within a CV)
So here’s what we’ve come up with.
Digitization provides greater access to materials, which may lead to the decision to preserve those files HOWEVER Digitization creates new digital objects that themselves require preservation Digitization creates metadata that requires preservation
The original painting (in this case the Mona Lisa) The digitized image The metadata The reality is that we are much more adept at this point in preserving analog objects like paintings and paper. That painting is over 500 years old. We can only dream that our digital files will last that long.
Again, lots of federal funding was directed towards the project. Do you have that kind of support? I know that we don’t. And, we’re lucky enough to have multiple staff...
Think about it up front.
Concerns regarding file formats include what media they’re saved on, and what software was used to create them. 5 ¼” floppies – drives aren’t available Wordstar – software isn’t compatible with current operating systems. When we no longer have the software to read them, we need to move them to an alternate format. This could result in data loss, changes in presentation, or may simply be impossible.
Proprietary software formats also offer a challenge. To handle files over time, we need to be able to read and possibly manipulate them to make sure they remain readable. If the company that originally created the file format goes out of business without divulging their source code, it can be incredibly difficult to still read it. Open source formats are preferable, because the source code has been made available.
So these are best practices for file formats: We all know that the State has made Microsoft products the standard, and in fact they’re the standard in general. However their file formats are proprietary, which can cause a preservation challenge. We can’t tell people not to use the tools they’re provided, but we can ask them…. Keep the original too (or better yet, send it to us)
Something near and dear to your hearts. The next issue we’d like to talk about is context, because files without context are adrift… Impress upon people the importance of keeping information about their files with their files. Metadata items – this may seem burdensome to people. Again, reinforce giving them to us. If we don’t know about it, it limits its usefulness. File names – intelligent identifiers are best demonstrated by an example…
Some dirty laundry – I discovered this folder out on the K drive earlier this week. This is work that someone did – and took a lot of care in doing – that we currently can’t use. We may be able to find the original object, but we won’t have any info on how the items were created or whether or not they’ve been manipulated. It’ll take more time to piece it together than it probably took to originally digitize.
.txt file in same directory or database that refers to file location SORTING!
special constraints Keeping access to a minimum to avoid accidental loss Something people always mention – especially pertinent b/c of older buildings a lot of agencies reside in. Much more prevalent is staff turnover – people don’t consider how to handle files in period of transition, often until employee is gone.