Overview of the metadata role in resource description, resource discovery and website faceting. The presentation discusses metadata consistency, granularity and types (descriptive, administrative and structural) with emphasis on technical and preservation metadata. The presentation introduces Dublin Core element set as well as other popular metadata schemas and their applications. The presentation also outlines the benefits of metadata reuse and the significant role of the Metadata application profile in structuring, normalizing, disambiguating and making metadata consistent and interoperable. Additionally, it points out the significance of using controlled vocabularies and their role in disambiguating words, synonym control and consistency across collections. Introduces types of controlled vocabularies and their applications, followed by examples of some issues related to inconsistency and redundancy when applying metadata using the large-scale digitization approach.
4. 3 | 8
Types
● Descriptive
● Administrative
○ Technical
○ Preservation
○ Rights
● Structural
Helpful resources
Caplan, Priscilla. "Understanding premis." Washington DC, USA: Library of Congress, 2009
Miller, Steven J. Metadata for digital collections: a how-to-do-it manual. New York, NY: Neal-Schuman Publishers, 2011
Metadata
5. 4 | 8
Reusing metadata
● Best practice for large
scale digitization
● Saves time and
resources
● Consistency among
systems
Metadata
6. 5 | 8
Metadata application profile
Overview
● Outlines metadata elements
● Sets encoding scheme
● Defines element occurrence
● Defines element obligation
● Collection specific vs. global
Metadata
Benefits
● Promotes consistency
● Enforce same description rules
for all digital objects
● Provides indexing guidelines
● Provides examples
● Supports interoperability
7. 6 | 8
Dublin Core
The Simple DC element set
1. Title
2. Creator
3. Subject
4. Description
5. Publisher
6. Contributor
7. Date
8. Type
9. Format
10. Identifier
11. Source
12. Language
13. Relation
14. Coverage
15. Rights
Metadata
http://www.dublincore.org/specifications/dublin-core/dces/
8. 7 | 8
Controlled vocabularies
Benefits
● Improve resource discovery
● Allow grouping of similar
objects
● Support search functionality
● Support browse functionality
● Support faceting
● Promote consistency
Role
● Disambiguate concepts
● Provide synonym control
Metadata
9. 8 | 8
Types
● Lists
● Synonym rings
● Authority files
● Thesauri
● Subject heading lists
Issues
● Inconsistency
● Redundancy
Examples
● AAT: Art and Architecture Thesaurus
● FAST: Faceted Application of Subject Terminology
● TGM: Thesaurus for Graphic Materials
● LCSH: Library of Congress Subject Headings
● MESH: Medical Subject Headings
● TGN: Getty Thesaurus of Geographic Names
● ULAN: Union List of Artist Names
Metadata
Notes de l'éditeur
[MARINA]
Overview
So what is metadata? We hear this term in the context of digital collections. But what does it really mean?
Metadata is the description we give to any object in the repository. Metadata for digital objects is like MARC records for the books in the library catalog. It describes the resource and supports easy resource discovery when users search by subject or any other searchable field (for example personal or corporate names).
Moreover, metadata enables a very important website feature known as faceting – faceting are the filtering options presented as a side menu on the website. Faceting enhances the search precision as users can directly interact with the metadata by refining their search results.
Finally, we should mention briefly consistency – metadata can be very powerful to promote consistency and support interoperability across collections and systems. We achieve consistency by having robust application profile, normalizing certain fields and using controlled vocabularies to describe the resources. We’ll get into details about controlled vocabularies in a moment.
[MARINA]
Granularity
Prior to beginning any metadata work, we need to consider the level of resource description in the collection – this is a very important decision point. Metadata can be very basic and concise or it can get as granular as we want it. The question is what resources we have (staff, legacy metadata), what’s the timeline, what’s the type of the project and what’s the format of the materials. For example, a large scale manuscript project as the Union Pacific or the Entertainment collection may be good candidates for large-scale approach, while a photographic digital exhibit with high research value may benefit from more granularity and item level approach.
The material format also affects granularity. For example: newspapers need very basic metadata (usually publication related – date, volume, issue number, title), whereas graphic materials usually call for richer descriptions – subjects, names of people, corporations, places, time period. Manuscripts fall in the middle – they get richer metadata compared to newspapers but less descriptive compared to photographs. OCR’ed text of manuscripts adds more value to the metadata and enhances the keyword search in the documents.
[MARINA]
Metadata comes in several flavors (see slide)
We already briefly mentioned that descriptive metadata provides the intellectual access to the content of a digital collection. It’s a set of elements used to describe, catalog or index digital resources. A good resource to learn more about metadata in general and how to do it the right way is a practical manual by Steven Miller. It will give you a comprehensive overview and many practical examples, and will help you get started.
Administrative metadata is a set of elements used to administer and manage digital objects and collections. This includes the name of the institution creating digital objects, date digitized, digitization equipment, masterfile name, etc. Administrative metadata has 2 subtypes:
Technical and preservation metadata collects the information needed for the long term preservation of the digital object, migration to other digital formats as software and hardware changes over time. The PREMIS working group defines preservation as “the information a repository uses to support the digital preservation process”. After completion of every digitization initiative, we should think strategically to organize and preserve the digital master files long term. Digital preservation is important because digital files are fragile (they can get damaged!), technology dependent and they quickly become obsolete as technologies change rapidly (floppy disks!). Digital content requires active management to ensure ongoing accessibility. To learn more about preservation metadata and standard you can refer to the PREMIS Data Dictionary for Preservation Metadata.
Our best practice is to capture important technical metadata during the digitization process. Our metadata profile has several technical metadata fields that record digitization date of the object, its format, type, size, and conversion specifications (for internal use only) - the camera information and digitization specification such as ppi and bit of the file. Currently, we are working on implementing some mandatory PREMIS fields for preservation purposes
Rights metadata deals with ownership, copyright, restrictions on use and reproduction
Structural metadata is used to internally structure a complex multi-page digital objects. Metadata is used to label each image and relate them as part of the same digital object, allowing users to navigate through them. We refer to these complex digital objects as parent and children.
[MARINA]
Reusing metadata
Occasionally, we need to start creating metadata from scratch for unprocessed collections. Usually, the best practice is to reuse metadata if it already exists in a finding aid. This is the prefered large-scale approach as it saves time and resources. So for large-scale projects we reuse metadata from technical services and normalize it to conform to the metadata standards.
The large-scale digitization process focuses on a folder level and it mirrors the structure of the archival collection, so does the reused metadata - the complex digital objects get parent-level metadata only from a prioritized list of terms. Sometimes, when the goal is more granularity and if the project calls for richer metadata (such as digitization of photographs), we reuse metadata and build upon it by adding more subject terms and conducting more research on name authorities and relationships among people and institutions on an item-level.
The graphic above demonstrates how reusing metadata works - in a nutshell it’s just reorganizing it and normalizing it. The bottom graphic shows how reused metadata can have added value - more subject terms, more fields (genre and location).
[MARINA]
So how do we make metadata consistent across all digital objects and collections? We use Metadata application profiles!
Overview
The Metadata application profile is documentation designed prior to beginning digitization. It structures, normalizes, defines and disambiguates the rules for metadata creation and promotes high quality metadata - consistent, useful and interoperable.
Application profiles outline the metadata element set and provide a set of rules for each of the fields - if it calls for controlled terms - what is the encoding scheme; if the field is repeatable or not; if the field is mandatory, optional or recommended. Metadata profiles can be global - to govern all digital collections in an institution or to be collection-specific. Sometimes, certain collections need more peculiar set of elements based on their format or content.
In addition to the set of rules for each field and the encoding scheme, metadata profiles may also offer instructions for metadata creators how to apply these rules and provide examples as well. Sometimes institutions may have 2 separate documents - the first is the metadata profile with the element set, controlled vocabularies and rules, and a separate document that provides the indexing guidelines for metadata creators with specific instructions and examples. Yet another example is a project-specific cheat sheet that conforms to the profile, but focuses on certain fields by offering more detail and clarification (Gayle will explain).
Benefits
Metadata application profile is important documentation that brings consistency across collections by enforcing the same description rules in all collections.
The Metadata application profiles support interoperability and are build on the metadata schema adopted in the institution. Interoperability is the ability of metadata exchange among different systems without special processing or loss of meaning. The metadata schema must be suitable for the content of the collections and to fit the existing infrastructure. You can’t adopt MODS if your system supports DC only. Institutions can also build custom metadata schema with local element set, but this practice is not recommended as it does not support interoperability. For example, if a local metadata field is mapped to standard schema as Dublin Core, in the aggregated environment the field label will not be as explicit and the metadata may not be clear or may not be mapped at all. If local elements are not mapped - they will not be displayed in the aggregated repository. If they are mapped from very specific local fields, the data may be unclear to users. For example local field “Neon sign type” can be mapped to Dublin Core subject field, but the label in the aggregated repository will be ‘subject’, and the values will not be as explicit, especially if they come from local controlled vocabulary. Or as you see on the example here on the slide 3 different local fields map to the DC spatial coverage and the values in these fields may be unclear.
[MARINA]
Dublin Core
I mentioned Dublin Core as one of the metadata standards. It is a small set used to describe digital resources.The Simple DC has 15 elements. Each of them are optional and repeatable. There’s no specific order how to present or use the element set. The way institutions use Dublin Core is to get some or all of the elements, and sometimes may combine them with local descriptive elements, with technical metadata fields. This is how the Metadata Application Profile is created. In an aggregated repository where multiple institutions share their data, all significant fields have to be mapped to DC so the values can be harvested and displayed.
[MARINA]
Controlled vocabularies help users retrieve sets of meaningfully related resources rather than single random resources on a topic. Controlled vocabularies allow grouping of digital objects that share significant similar characteristics. They are fundamental for the search and browse functionality. These functions work well only if metadata creators enter consistent, standardized values in the metadata fields.
Role
One of the main goals of controlled vocabularies is to disambiguate the meaning of a word - for instance, some words refer to different concepts, places or people. For example, ‘bank’ can be a financial institution or a container; ‘mercury’ can be metal or the planet. Controlled vocabularies help people distinguish between the different meanings.
The other important role is synonym control. Many natural language words refer to the same thing or concept. Controlled vocabularies identify and explicitly link the preferred terms used for metadata, while the synonym terms are used for cross reference.
[MARINA]
Types
Controlled vocabularies come in several types, but here we’ll focus on the types most commonly used in digital collections - thesauri, authority files and subject heading lists. For the sake of consistency and interoperability the use of established controlled vocabularies is recommended, although in some cases the content can be very specific and require a local vocabulary. Here we’ve listed some commonly used vocabularies. Some of them are linked data ready and provide persistent identifiers for each term.
Issues
Some of the issues that occur with controlled vocabularies are inconsistency and redundancy.
Inconsistency happens in the context of metadata creation - the people who describe the objects (especially if it’s graphical) are humans and may be subjective when selecting terms. Although the terms come from controlled authority file, this subjectivity leads to inconsistencies.
Redundancy happens again in the context of metadata creation - the metadata creators can select from a list of controlled terms and sometimes they select synonymous terms to describe the same object which is unnecessary and may lead to retrieval problems. For example the TGM vocabulary has multiple terms for photos: “photographs”, “photographic prints”, “color photographs”, “Black & white photographs”. Best practice is to decide which term is best for a specific resource type and stick to it rather than using multiple synonymous terms.