Last year Declan Fleming presented ALL TEH METADATAS and reviewed our UC San Diego Library Digital Asset Management system and RDF data model. You may be shocked to hear that all that metadata wasn't quite enough to handle increasingly complex digital library and research data in an elegant way. Our ad-hoc, 8-year-old data model has also been added to in inconsistent ways and our librarians and developers have not always been perfectly in sync in understanding how the data model has evolved over time.
In this presentation we'll review our process of locking a team of librarians and developers in a room to figure out a new data model, from domain definition through building and testing an OWL ontology. We¹ll also cover the challenges we ran into, including the review of existing controlled vocabularies and ontologies, or lack thereof, and the decisions made to cover the gaps. Finally, we'll discuss how we engaged the digital library community for feedback and what we have to do next. We all know that Things Fall Apart, this is our attempt at Doing Better This Time.
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Code4Lib 2013 - All THE Metadatas Re-Revisited
1. ALL TEH METADATAS
Re-revisited
2013 code{4}lib Meeting
February 13, 2013
Esmé Cowles
Matthew Critchlow
Bradley Westbrook
2. Overview
Overview
• Needs assessment and proposed solution
• Needs Assessment
• Data
• Data modeling Model Process
• Implementation
• Tool implementation
3. Overview
• Needs assessment and proposed solution
•
Needs Assessment
Data modeling
Brad Westbrook
• Tool implementation
8. Need Four: Align more strongly with DL
community
• Make sure UCSD RDF is public facing
– Use vocabularies in the public
– Make UCSD vocabularies public
• Develop technology stack
– Utilize contributions from non-UCSD sources
– Contribute to non-UCSD endeavors
10. Project Overview
Research Data Curation Pilot Deadline: June, 2013
Timeline: July 16, 2012 – Oct 29, 2012
Deliverables
• Abstract Data Model
• OWL/RDF Ontology
• Data Model Extension Guidelines
Team
Metadata Analyst: Arwen Hutt, Bradley Westbrook
IT: Esmé Cowles, Matt Critchlow, Longshou Situ
11. User Stories
As an administrative unit manager, I want to indicate any
external versions or descriptions of an object that may be of
probable importance to a user
As a user, I want to know what collection(s) an object belongs to
As a DAMS manager, I want to know what administrative unit an
object belongs to
12. Abstract Model – High Level
Related
Collection Collection Unit
Resource
Related
Component Component Object Object
Resource
File Component Component File
File
13. Abstract Model
DAMS
Event
Related
Resource
DAMS Provenance Provenance
Language Collection Collection
Resource Collection Part
Collection
Title Subject Name Relationship Date Note Object Unit
Name
Carto- Source
File
Vocabulary
Object
graphics
Relationship Component
Capture
Vocabulary
Entry
Role Rights Role
Component Copyright License Statute
Other
Rights
DAMS Data Model Rights
Entity-Relationship Diagram - 2013-02-07 Action
class linking
class inheritance
Restriction Permission
14. Data Dictionary
Title (title 1-m)
Copyright (copyright 1)
Language (language 1-m)
Administrative Unit (unit 1)
Relationship (relationship 0-m)
20. DAMS Repository
• New version of our lightweight repository
– Metadata in triplestore
– Files on disk or cloud storage
• Explicit structural metadata
• Native REST API
• Fedora REST API (partial)
21. DAMS Manager
• Separate Java webapp
• Ingest, batch operations
• Uses DAMS Repository REST API
• Functionality moved into the repository
– Characterization (JHove)
– Fixity checking
– Derivatives (ImageMagick)
22. DAMS Public Access System
• Old frontend is unsustainable
• New frontend in Hydra
– Backed by DAMS Repo, not Fedora
• Hydra platform and community
23. Timeline
• Started 2 months ago
• Code sprint in January with cbeer and jcoyne
• March: Beta release with research data
• Spring: Migrating existing content
• Summer: Production release
24. One More Thing
• We’ve talked about DAMS for years...
• Now we have code to share
http://github.com/ucsdlib/
@escowles @mattcritchlow
bdwestbrook@ucsd.edu
Notes de l'éditeur
Hi. I am Brad Westbrook. This is Matt Critchlow, and this is Esme Cowles. We would like to thank you all for selecting us to present this morning. We are going to tell you part of a continuing story. Declan Fleming told an earlier part of this story last year, and that part concerned how UCSD had implemented RDF for its Digital Asset Management System beginning in 2004 and how it hopes to move its DAMS into the linked data environment. In our part of the story, which is still more toward the beginning than the end, we are going to describe some of the barriers we found standing between the UCSD DAMS and the linked data terminus envisioned in Declan’s previous talk and what measures we are taking to get past those barriers.
I am going to briefly describe four key areas that we decided needed to be addressed to improve the functionality and sustainability of the UCSD DAMS.
The most basic need we have is to improve the consistency of the metadata in the DAMS. ANIMATION ONEWe acquire highly variable content files and metadata. **** Metadata may come in the form of comma separated values, Excel spreadsheets, MODS exports from the Archivists’ Toolkit, and MARCXML generated from our ILS. Some of the metadata we acquire was created according to established content and format standards. Other metadata has no standard behind it. Some metadata is very thorough, including titles, statements of responsibility, dates of creation or issuance, notes, and even subject and name headings. Other metadata can be very scanty and has to be supplemented as best as possible during the ingest process. ANIMATION TWO**** We normalize content files to supported file formats through the file ingest process, and we normalize acquired descriptive data through what we call the assembly plan process. As Declan described in his presentation last year, UCSD’s assembly plan process is a specification for stamping out object records for a class of objects, the class usually being defined by provenance. The assembly plan is engineered to insure that all objects in the DAMS have metadata elements and formats necessary to support rudimentary object interoperability within the DAMS and, moreover, that a class of objects is described similarly.The assembly plan as it is handed to ETL staff is expressed in XML and references the MODS, PREMIS, MIX, and METS schema. ETL staff, of which we have four, then use XSLT to transform the XML into RDF. ANIMATION THREESince the start of our work in 2005, the transformation from XML to RDF has been highly artisanal. **** Each ETL staff has created her or his own stylesheet for producing the transformations. Unfortunately, during this time there was no explicit data model to help control the stylesheets for uniformity. Consequently, uniformity among objects established in the assembly plan process is erased in the RDF transformations. These differences impact object interoperability, UI display, and, of course, user experience.
The second need we identified was to discover a way to maintain the syntax of hierarchical name and subject headings inherited from MARC records. MARC records are the source for approximately 70% of the DAMS object records. As this illustration indicates, the syntax of subject headings from a MARC record is corrupted in the RDF transformations. In this particular case, only one subject heading of the eight in the source MARC record is correctly transformed to RDF. All the rest are scrambled, with the primary term being demoted one or more levels in the hierarchy. One subject heading, in fact, acquires information not present in the source record. Needless to say, this problem also impacts searching and interpretation of the digital assets in the DAMS. It also reflects poorly on the Library’s metadata staff.
A third area we found needing improvement is our presentation of the many complex objects in the DAMS. A complex object is a multi-part object with components and metadata at component levels. The components may be few or many.The complex object record includes two structure maps. One is a physical map that simply represents the sequence of the files comprising the object. The other is a logical map that correlates components with any descriptive or rights metadata they might have. The current DAMS digital object viewer provides only the physical file map. In this example it is a scrollable list of 24 components. The audio file reflected in the viewing pane is the file for the first of the 24 components. But there is no way within the digital object viewer for a user to know that the audio file contains a discussion by three musicians about another musician’s technique.
But that metadata is in the object record, as this XML representation of the second component indicates. The descriptive metadata includes a title and names of presenters, as well as indication of what exact day of the multi-day conference the presentation occurred. This is all important information for navigating the 24 parts of the object. Because of this limitation, patrons and reference librarians have found the digital object viewer very difficult, in fact nearly impossible, to use for navigating multi-part objects. They can move through the nodes but they have no idea what a node contains until they activate the file. To help patrons understand and navigate complex objects, our public service librarians have often had to resort to using the Archivists’ Toolkit, which is only deployed at UCSD as a staff side authoring tool for complex digital object metadata. This extra step is very inconvenient to patron and staff and leads to considerable frustration.
The fourth area of need is to align our DAMS more strongly with the digital library community. We want to do that on a Micro-level by making sure our RDF is outward facing so we can utilize public vocabularies and also make our own vocabularies available for use by others.And we want to do it on a Macro-level by making sure we are developing a technology stack that can be shared with others and that others can contribute to, should they like. Our past methods have been tightly insular and, were they continued, would force us to make every vocabulary and technical advance by ourselves. We see strengthening our alignment with the community as the most important pathway toward a cost-effective, generally-purposed, and sustainable digital asset management service for UC San Diego and for others. These needs, especially the desire for stronger community alignment, triggered data modeling and implementation work over the last 10 months that Matt and Esme will now describe.ADVANCE TO NEXT SLIDE
In addition to the needs Brad just described, we found ourselves facing a very real hard deadline for our Research Data Curation Pilot Program.This deadline is what ultimately crystalized the need to prioritize this project.our Digital Library steering committee gave us a:Timeline of 3 monthsAsked to provide:- abstract data model- An OWL/RDF ontology. - Documentation for extending and modifying the data model over timeTeambalance of metadata analysts and it development staff.Address the previous gaps in understanding and communication that Brad described. Nothing lost in translation in this project / going forwardCross Section of RCI and the DLP (critical). The scope of our DAMS system has grown, and is not only internal library collections anymore.
So with the goal of representing the needs of both the Library and the research community on camps,our team decided to generate a user stories document for the data model projectFocusing here on the Objects, you can see various roles called out like administrative unit manager, DAMS manager and end user-After an initial pass at these user stories, we moved on to creating the requested deliverables themselves.
The first phase of the project involved the creation of an abstract data model as our domain definitionwe began with a high level entity relationship diagram, calling out the core base classes:Collection, Object, Component, and FileAnd the various relationships between themIf we had stopped here, we probably would have finished in a few weeks.But it would have also been a pretty useless data model
The next step was to create a flushed out abstract modelThe diagram has those same core classes of from the previous slideBut we have added modeling for Rights, events and descriptive metadata, and their relationships and inheritance structureOur current data model, as Declan talked about last year, is very flexible in the definition, re-use, and modification of predicates for our RDF triplesWhich gives us the ability to make global changes with relative ease. and this capability will be carried forward into the new systemBut the RDF triples are almost all stored inline with each object. And that’s something we wanted to change.Take Names for example, particularly the Name/Role Relationship.The main UCSD Library building, shown on our first slide, is called the Geisel Library. It’s named after Theodore Geisel who we all know and love as Dr. Suess.As a result his generous donations, we have a number of Dr. Seuss collections in our DAMS.Many of these objects each have a set of RDF triples of a the name Theodore Geisel and the Role, creator. We wanted to move towards a linked data approach where the Geisel name could stand on its own an object with an ARK, our permanent UID, that can also include external authority references like LOC SH and NAFAnd then, that name object could be referenced by collections, objects, components as needed.So we created our abstract model with this goal in mind.The next step was mapping this ERD diagram into a Data Dictionary
The data dictionary is represented as an enormous table of class hierarchy, properties, controlled vocabulary references/lists, constraints and notesJust a couple comments about it:While our objects can certainly be very richly described, as you’ve seen on previous slides already, we wanted to allow for a wide range of description SO in our new data model, an object only needs to have a title, a copyright statement, a language, be associated with a Administrative UnitNote that: Object doesn’t have to have a fileThis was a relatively recent requirement that surfaced in our research data projects, and one we wanted to call out and support.Traditionally, we have always required an object to have a file, as part of its definition.2. This document was the transitioning point between the abstract modeling and creating an ontology.As a result this is where we spent a lot of our time in the projectIt was at this point that we began looking closely at standards we were familiar with, such as MODS, PREMIS, MIX, Dublin CoreAnd then began looking for ontologies built on these standards we could use for our data modelContinuing with our Relationship example, we defined that an Object can have 0 to many Relationships as Name/Role pairs.The question we had to start asking ourselves then were:How to model the Names, Roles, and relationships from an ontology we know of?ORare there other ontologies we should be considering instead?ORIs there nothing that directly covers a particular need, and we need to, at least temporarily, model it ourselves?These were the kind of questions we needed answers to as we started creating our Ontology in Protégé
With Names, we were fortunate to find our answer in the MADS ontology.MADS stands for Metadata Authority Description Schema and is available on the library of congress website.It has a specification for Names, as well as Subjects, that aligned really well with our needsSo we were able to define in our ontology that a Relationship consists of:one MADS name type (Conference, Corporate, Family or Personal Name) At least one Role. Our Roles controlled vocabulary is the MARC relator codes. Also available on the LOC website.With Rights metadata we were less fortunate. We found the PREMIS draft ontology, read through it, and pulled it into Protege to review it.but it is still in draft form, and we need a working solution now.Not just because of our deadline, but also because our rights metadata drives our access model in the DAMS.So we found ourselves striking a balance between leveraging a community standard ontology like MADS, and creating a PREMIS-like rights metadata implementation locallyTo be clear, our intention is that when the PREMIS ontology is in a production state we will adopt it in place of our local implementation.These are just two examples, but they’re very indicative of how the ontology creation process went for usI want to close before I hand things off to Esme with a little more Dr Seuss
Two images created by Dr. Seuss that are from two different collections in our DAMS.L: Dr. Seuss Went to War CollectionR: Dr. Seuss Advertising Artwork CollectionEach have metadata properties that makes them distinct, but they also reference entities than can be sharedgoing back to goal of a linked data solution, we can now define the following in the new data model ->
In this diagramSpecific properties to each objects. Title on left, Date on rightShared entities that can be referenced by any object in DAMSEach have a mads:Topic with a LOC SHEach share the same relationshipMADS Personal Name - Dr. SeussRole - CreatorSo in the span of a few months, we ended up with a new data model that we’re pretty proud ofEsme is going to catch you up on what we’ve been working on since November->
Implementation phase includes people from our digital library program, research data curation program, metadata analysts, and developers.
Our data model got a little more complex, so we knew we’d need to make some pretty substantial changes.
New version of our repository to handle new data model.Same architecture: triples and files. Cloud storage.Big change is consolidating our servlets into a coherent REST APIPartial clone of the Fedora REST API.
Some changes to our administrative app (ingest, batch).Big change is using the repository REST APIUsed to talk to files and triples directlyFunctionality moved into Repo, triggered by REST API calls.File characterization with JHoveFixity checkingDerivative generation
Biggest changes to frontend.Our current frontend is a heavy, client-sideJavascript application.Have to maintain everything ourselves, all data is retrieved using AJAX so search engines don’t index our content, generally unsustainable.Chose Hydra because of the platform and the community.New to Rails and really liking it.Really vibrant community.
We’ve gone from nothing to having a rough but working app in about 2 months.Code sprint with Chris Beer and Justin Coyne really helped get us within striking distance of working system.Beta release limited to research data in two weeks.On track for migrating this spring, and release this summer.
We’ve been coming to c4l since the beginning and taking about DAMS.Now I’m very happy we have actual code to share.