SlideShare une entreprise Scribd logo
1  sur  20
From Box to Hydra via Archivematica
Turning proof of concept into reality
Background
• University of Hull and University of York working on a Research Data Spring
project
• Filling the digital preservation gap, 2015-16
• https://www.york.ac.uk/borthwick/projects/archivematica/
• Dual use cases for the University of Hull
• Digital preservation of archival materials
• Management and preservation of research data
Systems background
• Box
• Institutional subscription from 2015
• Supported and managed personal cloud storage service
• Archivematica
• No experience prior to the project, but had watched its development over a period
of years
• Particularly liked the combination of microservices that can be used flexibly
according to use case
Repository
• Hydra digital repository – http://hydra.hull.ac.uk
• Implemented 2012 based on previous Fedora repository
• Designed to hold any structured digital collection (within reason!) to meet
University’s needs
• NB ** Hydra is now Samvera **
• Community is refreshing and re-launching for the next decade
• Watch this space – http://samvera.org
• New website and logo coming shortly
Questions
• How can we enable a preservation workflow with the systems environment
available to us?
• How can we facilitate pathways to preserving archival materials and
research data alongside each other?
• What is required to bring these different components together to best
effect?
Ingest to the system, either direct
or via ingest folder (Box)
Archivematica captures content
and processes it through
microservices
Archivematica outputs AIP for
storage and DIP for repository
DIP processor unpacks DIPs and
creates repository objects
Repository manages access to
objects
Project focus
• User assembles files and simple descriptive file(s) in Box
folder. Shares the folder with Archivematica
• System checks folder contents and if OK creates a bag
(BagIt standard) for each object which is passed to
Archivematica
• Archivematica processes the bag to create an AIP which
goes to a preservation store…
• …and also a DIP which is passed to the DIP processor
• DIP processor creates Hydra objects from the DIP
contents and injects them into the repository QA
queue…
• …matched to the AIP by UUID
Joining up the dots
• The joins between the three components were:
• A ‘Box-watcher’ – users share their data with a nominated Box user account for the
archivematica system. This account watch for shares with it, and automatically
create a BAGIT of the files found and transfer this to archivematica for processing
• A ‘DIP processor’ – this takes the BAGIT DIP from archivematica, breaks it open and
uses the information within this to create repository objects
• These tools were wrapped into a single gem, hullsync
• https://github.com/uohull/hullsync
Deposit options
• Depositors have several options:
• A folder containing multiple data files and one descriptive file  a single AIP and a single repository
object with (optionally) one or more surrogate files for download (so can be a “metadata-only”
record)
• A folder containing multiple files and a csv file (one row per file)  multiple AIPs with multiple
repository objects, each with (optionally) a surrogate for download
• A folder containing the top-level folder of a structure  a zipped structure in a single AIP and a single
repository object (optionally) containing the zipped file for download
In detail – option 1
• A folder containing multiple data files and one descriptive file  a single
AIP and a single repository object with (optionally) one or more surrogate
files for download (so can be a “metadata-only” record)
• Data files are associated with a .txt descriptive file providing associated metadata
• Descriptive file can be used to determine access permissions and content model
• Descriptive metadata can be provided using Dublin Core
• Can also submit README.txt for information to inform repository staff on
appropriate actions
In detail – option 2
• A folder containing multiple files and a csv file (one row per file)  multiple
AIPs with multiple repository objects, each with (optionally) a surrogate for
download
• Use a .csv file instead of a .txt file for the descriptive information
• Use column headings to cover the same fields as in option 1
• Can associate the same or different metadata with each object
• Can create simple or compound objects
In detail – option 3
• A folder containing the top-level folder of a structure  a zipped structure
in a single AIP and a single repository object (optionally) containing the
zipped file for download
• Aim is to allow the submission of a folder or nested folders of data, replicating how
the files are organised
• Files are unpacked by Archivematica, analysed, and then re-zipped up for submission
to the repository
Lessons learned
• Error handling needs attention when turning the p-o-c into production
• But the testing highlighted a lot of the errors that would need handling
• A key element when joining systems together
• Normalisation of filetypes requires additional consideration
• E.g., how to deal with TIFF files converted to JPG
• The zipping and unzipping workflows require further attention to ensure
success for this option
Next steps
• Take learning and tools from the Research Data Spring project and use these
as the basis for development of services
• Two use cases
• Research data storage and management service development
• City of Culture digital archive
• Understanding Archivematica pipelines and options better – Perpetua test!
• Focus on improving proof-of-concept and developing additional
functionality
Research data storage and management
• Joint Library and ICTD project to discover and understand research data
storage and management needs amongst academic staff
• Open workshops
• Data interviews
• Capture and processing of research data a part of local provision alongside
advice and guidance on options outside the institution
City of Culture digital archive
• Hull2017 – City of Culture
• Events throughout the year
• Four data elements
• Business archive
• Creative archive
• Participatory archive
• Research and evaluation archive
• Applying the same technology environment to manage ingest and delivery
Key issues going forward
• What are the differences in pipeline processing in Archivematica between
research data and archival materials?
• Dealing with unusual file formats – a key learning point from the RDS
project
• Scaling up to meet heavier data demands
• Being realistic about what we can’t use this environment for and need
alternative approaches, e.g., Big Data
To conclude
• Combining components has its issues, but it has been better to exploit
systems that do certain parts of the workflow well and turn them into more
than the sum of their parts
• Data is not simple
• We need flexibility in how we look to manage it
• We need engagement with researchers to understand it
• Turning an idea into production needs careful planning
• Scope for community exchange or training on how to do this?
Thank you
c.awre@hull.ac.uk
(And many thanks to the University of York and my colleagues Richard Green and
Simon Wilson, plus Cottage Labs LLC for their work on this)

Contenu connexe

Tendances

Grant Funding Programme
Grant Funding ProgrammeGrant Funding Programme
Grant Funding ProgrammeJisc RDM
 
Engaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh UniversityEngaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh UniversityRobin Rice
 
UK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schemaUK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schemaJisc RDM
 
Scottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShareScottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShareRobin Rice
 
Business cases and costs RDN
Business cases and costs RDNBusiness cases and costs RDN
Business cases and costs RDNJisc RDM
 
Lightning Talk - Angela Dappart
Lightning Talk - Angela DappartLightning Talk - Angela Dappart
Lightning Talk - Angela DappartJisc RDM
 
Data sharing in the Netherlands
Data sharing in the NetherlandsData sharing in the Netherlands
Data sharing in the NetherlandsJisc RDM
 
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2 PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2 EDINA, University of Edinburgh
 
Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...Jisc RDM
 
Going for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial MetadataGoing for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial MetadataEDINA, University of Edinburgh
 
Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016Jisc RDM
 
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011EDINA, University of Edinburgh
 
National data services lightening talk at the RDA
National data services lightening talk at the RDANational data services lightening talk at the RDA
National data services lightening talk at the RDAJisc RDM
 
RDM shared services at IDCC
RDM shared services at IDCCRDM shared services at IDCC
RDM shared services at IDCCJisc RDM
 
Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...Jisc RDM
 
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShareResearch Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShareHistoric Environment Scotland
 
COBWEB technology platform and future development needs
COBWEB technology platform and future development needsCOBWEB technology platform and future development needs
COBWEB technology platform and future development needsEDINA, University of Edinburgh
 

Tendances (20)

Grant Funding Programme
Grant Funding ProgrammeGrant Funding Programme
Grant Funding Programme
 
SMRUDAS
SMRUDAS SMRUDAS
SMRUDAS
 
Engaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh UniversityEngaging researchers in RDM & Open Data at Edinburgh University
Engaging researchers in RDM & Open Data at Edinburgh University
 
UK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schemaUK Research Data Discovery Service metadata schema
UK Research Data Discovery Service metadata schema
 
Scottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShareScottish Digital Library Consortium Meeting: Edinburgh DataShare
Scottish Digital Library Consortium Meeting: Edinburgh DataShare
 
Business cases and costs RDN
Business cases and costs RDNBusiness cases and costs RDN
Business cases and costs RDN
 
Lightning Talk - Angela Dappart
Lightning Talk - Angela DappartLightning Talk - Angela Dappart
Lightning Talk - Angela Dappart
 
Data sharing in the Netherlands
Data sharing in the NetherlandsData sharing in the Netherlands
Data sharing in the Netherlands
 
Who is doing what, and how do we know? [PEPRS]
Who is doing what, and how do we know? [PEPRS]Who is doing what, and how do we know? [PEPRS]
Who is doing what, and how do we know? [PEPRS]
 
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2 PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
PECAN Phase 2: Pilot for Ensuring Continuity of Access via Nesli2
 
Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...Jisc Research Data Management Shared Service Workshop: An institutional persp...
Jisc Research Data Management Shared Service Workshop: An institutional persp...
 
Going for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial MetadataGoing for GOLD - Adventures in Open Linked Geospatial Metadata
Going for GOLD - Adventures in Open Linked Geospatial Metadata
 
Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016Jisc research data shared service overview IDCC 2016
Jisc research data shared service overview IDCC 2016
 
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
Using OpenURL Activity Data for Activity Data Programme Meeting 05 July 2011
 
National data services lightening talk at the RDA
National data services lightening talk at the RDANational data services lightening talk at the RDA
National data services lightening talk at the RDA
 
RDM shared services at IDCC
RDM shared services at IDCCRDM shared services at IDCC
RDM shared services at IDCC
 
Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...Research at risk: developing a shared research data management service for UK...
Research at risk: developing a shared research data management service for UK...
 
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShareResearch Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
Research Data Services @ Edinburgh: MANTRA & Edinburgh DataShare
 
RDA UK
RDA UKRDA UK
RDA UK
 
COBWEB technology platform and future development needs
COBWEB technology platform and future development needsCOBWEB technology platform and future development needs
COBWEB technology platform and future development needs
 

Similaire à From Box to Hydra via Archivematica

Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research dataARDC
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...datascienceiqss
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Jenny Mitcham
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with ArchivematicaJenny Mitcham
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...Jenny Mitcham
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebPascal-Nicolas Becker
 
Data Storage
Data StorageData Storage
Data StorageMoghees1
 
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteve Androulakis
 
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ARDC
 
MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)Nikos Palavitsinis, PhD
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...Jenny Mitcham
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMnortherncollaboration
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...Jenny Mitcham
 
Montemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revisedMontemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revisedGabe Montemayor
 
OpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseOpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseHostway|HOSTING
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724mikeum
 

Similaire à From Box to Hydra via Archivematica (20)

Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
 
Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...Project update: A collaborative approach to "filling the digital preservation...
Project update: A collaborative approach to "filling the digital preservation...
 
"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica"Filling the Digital Preservation Gap" with Archivematica
"Filling the Digital Preservation Gap" with Archivematica
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Presentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenbergPresentation 16 may keynote karin bredenberg
Presentation 16 may keynote karin bredenberg
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
Data Storage
Data StorageData Storage
Data Storage
 
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data LifecycleSteven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
Steven McEachern - ADA, DDI (metadata standard) and the Data Lifecycle
 
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
ADA, DDI and the data lifecycle - Steve McEachern - 7 April 2017
 
MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)MetadataTheory: Learning Repositories Technologies (9th of 10)
MetadataTheory: Learning Repositories Technologies (9th of 10)
 
A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...A collaborative approach to "filling the digital preservation gap" for Resear...
A collaborative approach to "filling the digital preservation gap" for Resear...
 
A collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDMA collaborative approach to filling the digital preservation gap for RDM
A collaborative approach to filling the digital preservation gap for RDM
 
“Filling the digital preservation gap” an update from the Jisc Research Data ...
“Filling the digital preservation gap”an update from the Jisc Research Data ...“Filling the digital preservation gap”an update from the Jisc Research Data ...
“Filling the digital preservation gap” an update from the Jisc Research Data ...
 
Montemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revisedMontemayor_AIMS_Inventory_Presentation_revised
Montemayor_AIMS_Inventory_Presentation_revised
 
OpenStack Swift In the Enterprise
OpenStack Swift In the EnterpriseOpenStack Swift In the Enterprise
OpenStack Swift In the Enterprise
 
BatIg
BatIgBatIg
BatIg
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724Shallcross code4lib-midwest 20150724
Shallcross code4lib-midwest 20150724
 

Plus de Jisc RDM

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_BurlandJisc RDM
 
Jisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc RDM
 
Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc RDM
 
Jisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc RDM
 
Building a national Data Repository Data Modelling
Building a national Data Repository Data ModellingBuilding a national Data Repository Data Modelling
Building a national Data Repository Data ModellingJisc RDM
 
Building a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewBuilding a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewJisc RDM
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Jisc RDM
 
Research Data Toolkit
Research Data ToolkitResearch Data Toolkit
Research Data ToolkitJisc RDM
 
Pre jisc datachampday_260318
Pre jisc datachampday_260318Pre jisc datachampday_260318
Pre jisc datachampday_260318Jisc RDM
 
Stories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) okStories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) okJisc RDM
 
Fair data - dinkum research - by Andy Turner
Fair data -  dinkum research - by Andy TurnerFair data -  dinkum research - by Andy Turner
Fair data - dinkum research - by Andy TurnerJisc RDM
 
2018 03 codata - making the case
2018 03 codata - making the case2018 03 codata - making the case
2018 03 codata - making the caseJisc RDM
 
Research Data Shared Service update at DPC
Research Data Shared Service update at DPCResearch Data Shared Service update at DPC
Research Data Shared Service update at DPCJisc RDM
 
Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Jisc RDM
 
Managing data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMManaging data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMJisc RDM
 
Managing data behind creative masterpieces
Managing data behind creative masterpiecesManaging data behind creative masterpieces
Managing data behind creative masterpiecesJisc RDM
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - IntroJisc RDM
 
Lightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanLightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanJisc RDM
 
Lightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardLightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardJisc RDM
 
Lightning talk - Adam Harwood
Lightning talk - Adam HarwoodLightning talk - Adam Harwood
Lightning talk - Adam HarwoodJisc RDM
 

Plus de Jisc RDM (20)

2019-06_Eunis_Burland
2019-06_Eunis_Burland2019-06_Eunis_Burland
2019-06_Eunis_Burland
 
Jisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 PaperJisc Research Data Shared Service Open Repositories 2018 Paper
Jisc Research Data Shared Service Open Repositories 2018 Paper
 
Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7Jisc Research Data Shared Service Open Repositories 2018 24x7
Jisc Research Data Shared Service Open Repositories 2018 24x7
 
Jisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case studyJisc Research Data Shared Service - a Samvera case study
Jisc Research Data Shared Service - a Samvera case study
 
Building a national Data Repository Data Modelling
Building a national Data Repository Data ModellingBuilding a national Data Repository Data Modelling
Building a national Data Repository Data Modelling
 
Building a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture OverviewBuilding a national Data Repository System Integration Architecture Overview
Building a national Data Repository System Integration Architecture Overview
 
Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018Building a National Data Service Open Repositories 2018
Building a National Data Service Open Repositories 2018
 
Research Data Toolkit
Research Data ToolkitResearch Data Toolkit
Research Data Toolkit
 
Pre jisc datachampday_260318
Pre jisc datachampday_260318Pre jisc datachampday_260318
Pre jisc datachampday_260318
 
Stories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) okStories from the Field: Data are Messy and that's (kind of) ok
Stories from the Field: Data are Messy and that's (kind of) ok
 
Fair data - dinkum research - by Andy Turner
Fair data -  dinkum research - by Andy TurnerFair data -  dinkum research - by Andy Turner
Fair data - dinkum research - by Andy Turner
 
2018 03 codata - making the case
2018 03 codata - making the case2018 03 codata - making the case
2018 03 codata - making the case
 
Research Data Shared Service update at DPC
Research Data Shared Service update at DPCResearch Data Shared Service update at DPC
Research Data Shared Service update at DPC
 
Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1Research Data Shared Service Webinar #1
Research Data Shared Service Webinar #1
 
Managing data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCMManaging data behind creative masterpieces -RCM
Managing data behind creative masterpieces -RCM
 
Managing data behind creative masterpieces
Managing data behind creative masterpiecesManaging data behind creative masterpieces
Managing data behind creative masterpieces
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - Intro
 
Lightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellanLightning Talk - Andrew MacLellan
Lightning Talk - Andrew MacLellan
 
Lightning Talk - Nick Sheppard
Lightning Talk - Nick SheppardLightning Talk - Nick Sheppard
Lightning Talk - Nick Sheppard
 
Lightning talk - Adam Harwood
Lightning talk - Adam HarwoodLightning talk - Adam Harwood
Lightning talk - Adam Harwood
 

Dernier

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 

Dernier (20)

Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 

From Box to Hydra via Archivematica

  • 1. From Box to Hydra via Archivematica Turning proof of concept into reality
  • 2. Background • University of Hull and University of York working on a Research Data Spring project • Filling the digital preservation gap, 2015-16 • https://www.york.ac.uk/borthwick/projects/archivematica/ • Dual use cases for the University of Hull • Digital preservation of archival materials • Management and preservation of research data
  • 3. Systems background • Box • Institutional subscription from 2015 • Supported and managed personal cloud storage service • Archivematica • No experience prior to the project, but had watched its development over a period of years • Particularly liked the combination of microservices that can be used flexibly according to use case
  • 4. Repository • Hydra digital repository – http://hydra.hull.ac.uk • Implemented 2012 based on previous Fedora repository • Designed to hold any structured digital collection (within reason!) to meet University’s needs • NB ** Hydra is now Samvera ** • Community is refreshing and re-launching for the next decade • Watch this space – http://samvera.org • New website and logo coming shortly
  • 5. Questions • How can we enable a preservation workflow with the systems environment available to us? • How can we facilitate pathways to preserving archival materials and research data alongside each other? • What is required to bring these different components together to best effect?
  • 6. Ingest to the system, either direct or via ingest folder (Box) Archivematica captures content and processes it through microservices Archivematica outputs AIP for storage and DIP for repository DIP processor unpacks DIPs and creates repository objects Repository manages access to objects
  • 7. Project focus • User assembles files and simple descriptive file(s) in Box folder. Shares the folder with Archivematica • System checks folder contents and if OK creates a bag (BagIt standard) for each object which is passed to Archivematica • Archivematica processes the bag to create an AIP which goes to a preservation store… • …and also a DIP which is passed to the DIP processor • DIP processor creates Hydra objects from the DIP contents and injects them into the repository QA queue… • …matched to the AIP by UUID
  • 8. Joining up the dots • The joins between the three components were: • A ‘Box-watcher’ – users share their data with a nominated Box user account for the archivematica system. This account watch for shares with it, and automatically create a BAGIT of the files found and transfer this to archivematica for processing • A ‘DIP processor’ – this takes the BAGIT DIP from archivematica, breaks it open and uses the information within this to create repository objects • These tools were wrapped into a single gem, hullsync • https://github.com/uohull/hullsync
  • 9. Deposit options • Depositors have several options: • A folder containing multiple data files and one descriptive file  a single AIP and a single repository object with (optionally) one or more surrogate files for download (so can be a “metadata-only” record) • A folder containing multiple files and a csv file (one row per file)  multiple AIPs with multiple repository objects, each with (optionally) a surrogate for download • A folder containing the top-level folder of a structure  a zipped structure in a single AIP and a single repository object (optionally) containing the zipped file for download
  • 10. In detail – option 1 • A folder containing multiple data files and one descriptive file  a single AIP and a single repository object with (optionally) one or more surrogate files for download (so can be a “metadata-only” record) • Data files are associated with a .txt descriptive file providing associated metadata • Descriptive file can be used to determine access permissions and content model • Descriptive metadata can be provided using Dublin Core • Can also submit README.txt for information to inform repository staff on appropriate actions
  • 11. In detail – option 2 • A folder containing multiple files and a csv file (one row per file)  multiple AIPs with multiple repository objects, each with (optionally) a surrogate for download • Use a .csv file instead of a .txt file for the descriptive information • Use column headings to cover the same fields as in option 1 • Can associate the same or different metadata with each object • Can create simple or compound objects
  • 12. In detail – option 3 • A folder containing the top-level folder of a structure  a zipped structure in a single AIP and a single repository object (optionally) containing the zipped file for download • Aim is to allow the submission of a folder or nested folders of data, replicating how the files are organised • Files are unpacked by Archivematica, analysed, and then re-zipped up for submission to the repository
  • 13. Lessons learned • Error handling needs attention when turning the p-o-c into production • But the testing highlighted a lot of the errors that would need handling • A key element when joining systems together • Normalisation of filetypes requires additional consideration • E.g., how to deal with TIFF files converted to JPG • The zipping and unzipping workflows require further attention to ensure success for this option
  • 14. Next steps • Take learning and tools from the Research Data Spring project and use these as the basis for development of services • Two use cases • Research data storage and management service development • City of Culture digital archive • Understanding Archivematica pipelines and options better – Perpetua test! • Focus on improving proof-of-concept and developing additional functionality
  • 15.
  • 16. Research data storage and management • Joint Library and ICTD project to discover and understand research data storage and management needs amongst academic staff • Open workshops • Data interviews • Capture and processing of research data a part of local provision alongside advice and guidance on options outside the institution
  • 17. City of Culture digital archive • Hull2017 – City of Culture • Events throughout the year • Four data elements • Business archive • Creative archive • Participatory archive • Research and evaluation archive • Applying the same technology environment to manage ingest and delivery
  • 18. Key issues going forward • What are the differences in pipeline processing in Archivematica between research data and archival materials? • Dealing with unusual file formats – a key learning point from the RDS project • Scaling up to meet heavier data demands • Being realistic about what we can’t use this environment for and need alternative approaches, e.g., Big Data
  • 19. To conclude • Combining components has its issues, but it has been better to exploit systems that do certain parts of the workflow well and turn them into more than the sum of their parts • Data is not simple • We need flexibility in how we look to manage it • We need engagement with researchers to understand it • Turning an idea into production needs careful planning • Scope for community exchange or training on how to do this?
  • 20. Thank you c.awre@hull.ac.uk (And many thanks to the University of York and my colleagues Richard Green and Simon Wilson, plus Cottage Labs LLC for their work on this)