1. From Box to Hydra via Archivematica
Turning proof of concept into reality
2. Background
• University of Hull and University of York working on a Research Data Spring
project
• Filling the digital preservation gap, 2015-16
• https://www.york.ac.uk/borthwick/projects/archivematica/
• Dual use cases for the University of Hull
• Digital preservation of archival materials
• Management and preservation of research data
3. Systems background
• Box
• Institutional subscription from 2015
• Supported and managed personal cloud storage service
• Archivematica
• No experience prior to the project, but had watched its development over a period
of years
• Particularly liked the combination of microservices that can be used flexibly
according to use case
4. Repository
• Hydra digital repository – http://hydra.hull.ac.uk
• Implemented 2012 based on previous Fedora repository
• Designed to hold any structured digital collection (within reason!) to meet
University’s needs
• NB ** Hydra is now Samvera **
• Community is refreshing and re-launching for the next decade
• Watch this space – http://samvera.org
• New website and logo coming shortly
5. Questions
• How can we enable a preservation workflow with the systems environment
available to us?
• How can we facilitate pathways to preserving archival materials and
research data alongside each other?
• What is required to bring these different components together to best
effect?
6. Ingest to the system, either direct
or via ingest folder (Box)
Archivematica captures content
and processes it through
microservices
Archivematica outputs AIP for
storage and DIP for repository
DIP processor unpacks DIPs and
creates repository objects
Repository manages access to
objects
7. Project focus
• User assembles files and simple descriptive file(s) in Box
folder. Shares the folder with Archivematica
• System checks folder contents and if OK creates a bag
(BagIt standard) for each object which is passed to
Archivematica
• Archivematica processes the bag to create an AIP which
goes to a preservation store…
• …and also a DIP which is passed to the DIP processor
• DIP processor creates Hydra objects from the DIP
contents and injects them into the repository QA
queue…
• …matched to the AIP by UUID
8. Joining up the dots
• The joins between the three components were:
• A ‘Box-watcher’ – users share their data with a nominated Box user account for the
archivematica system. This account watch for shares with it, and automatically
create a BAGIT of the files found and transfer this to archivematica for processing
• A ‘DIP processor’ – this takes the BAGIT DIP from archivematica, breaks it open and
uses the information within this to create repository objects
• These tools were wrapped into a single gem, hullsync
• https://github.com/uohull/hullsync
9. Deposit options
• Depositors have several options:
• A folder containing multiple data files and one descriptive file a single AIP and a single repository
object with (optionally) one or more surrogate files for download (so can be a “metadata-only”
record)
• A folder containing multiple files and a csv file (one row per file) multiple AIPs with multiple
repository objects, each with (optionally) a surrogate for download
• A folder containing the top-level folder of a structure a zipped structure in a single AIP and a single
repository object (optionally) containing the zipped file for download
10. In detail – option 1
• A folder containing multiple data files and one descriptive file a single
AIP and a single repository object with (optionally) one or more surrogate
files for download (so can be a “metadata-only” record)
• Data files are associated with a .txt descriptive file providing associated metadata
• Descriptive file can be used to determine access permissions and content model
• Descriptive metadata can be provided using Dublin Core
• Can also submit README.txt for information to inform repository staff on
appropriate actions
11. In detail – option 2
• A folder containing multiple files and a csv file (one row per file) multiple
AIPs with multiple repository objects, each with (optionally) a surrogate for
download
• Use a .csv file instead of a .txt file for the descriptive information
• Use column headings to cover the same fields as in option 1
• Can associate the same or different metadata with each object
• Can create simple or compound objects
12. In detail – option 3
• A folder containing the top-level folder of a structure a zipped structure
in a single AIP and a single repository object (optionally) containing the
zipped file for download
• Aim is to allow the submission of a folder or nested folders of data, replicating how
the files are organised
• Files are unpacked by Archivematica, analysed, and then re-zipped up for submission
to the repository
13. Lessons learned
• Error handling needs attention when turning the p-o-c into production
• But the testing highlighted a lot of the errors that would need handling
• A key element when joining systems together
• Normalisation of filetypes requires additional consideration
• E.g., how to deal with TIFF files converted to JPG
• The zipping and unzipping workflows require further attention to ensure
success for this option
14. Next steps
• Take learning and tools from the Research Data Spring project and use these
as the basis for development of services
• Two use cases
• Research data storage and management service development
• City of Culture digital archive
• Understanding Archivematica pipelines and options better – Perpetua test!
• Focus on improving proof-of-concept and developing additional
functionality
15.
16. Research data storage and management
• Joint Library and ICTD project to discover and understand research data
storage and management needs amongst academic staff
• Open workshops
• Data interviews
• Capture and processing of research data a part of local provision alongside
advice and guidance on options outside the institution
17. City of Culture digital archive
• Hull2017 – City of Culture
• Events throughout the year
• Four data elements
• Business archive
• Creative archive
• Participatory archive
• Research and evaluation archive
• Applying the same technology environment to manage ingest and delivery
18. Key issues going forward
• What are the differences in pipeline processing in Archivematica between
research data and archival materials?
• Dealing with unusual file formats – a key learning point from the RDS
project
• Scaling up to meet heavier data demands
• Being realistic about what we can’t use this environment for and need
alternative approaches, e.g., Big Data
19. To conclude
• Combining components has its issues, but it has been better to exploit
systems that do certain parts of the workflow well and turn them into more
than the sum of their parts
• Data is not simple
• We need flexibility in how we look to manage it
• We need engagement with researchers to understand it
• Turning an idea into production needs careful planning
• Scope for community exchange or training on how to do this?
20. Thank you
c.awre@hull.ac.uk
(And many thanks to the University of York and my colleagues Richard Green and
Simon Wilson, plus Cottage Labs LLC for their work on this)