Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Experience with Ingestion of Large Collections
Stuart Kenny
Research IT
Trinity College Dublin
Stuart Kenny
Research IT
Trinity College Dublin
The Fairy Tales of Charles Perrault. Illustrated by Harry Clarke.
Intro. T...
About DRI (https://repository.dri.ie/)
● DRI is an interactive trusted digital repository for
contemporary and historical,...
Outline
• What’s our problem?
• Example collections
• Ingest solutions
• Current ingest process
• Possible future process
Ingesting Objects
• Ingest form
o Suitable for single
objects/small collections
o Flat hierarchies
o Simple metadata stand...
Example Collection: Clarke Stained Glass
• MODS metadata
• 10,025 objects
• 42 sub-collections
• 20,047 files, 2.82 TB
• P...
Example Collection: TCD Children’s Books
• MARC metadata
• 207,889 objects
• 16 sub-collections
• Problems:
o Large number...
Example Collection: Kilkenny Design Workshop
• EAD metadata
• 2,040 objects
• 2,734 series/files
• 2,231 files, 1.2GB
• Pr...
EAD, and why I don’t quite hate it as much as I did...
• Single XML file upload
• Structure encoded in metadata
• URLs to ...
Sufia Batch Upload
• Add multiple files
• New work for each
• Metadata for each
work
• How to handle
multiple standards?
•...
Avalon Batch Ingest
• Ingest package
o Manifest file
o Plus content files
• Manifest file is spreadsheet
o Metadata for it...
Approach up to now
• Command line client
o Enter text commands at ‘command prompt’
• Written in Ruby
• Run locally by user...
Problems
• Lack of user familiarity with command line
• Multiple platform support
o i.e., Windows
• Difficulty of installi...
Current Attempt
• Web-based UI
• Borrow heavily from Avalon approach
• Upload metadata XML plus assets to online storage
•...
Current Attempt
UI
Online
Storage Repository
Select
manifest
Retrieve
remote
files
Ingest
Update
status
• Hydra BrowseEverything
o Gem to access cloud storage
o DropBox, Google Drive…
• User uploads files
• In UI selects colle...
Outstanding Issues
• Online storage
o Dropbox type storage size limits
• Creating spreadsheet less easy than directory str...
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI
Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI
Prochain SlideShare
Chargement dans…5
×

Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

185 vues

Publié le

Presentation given by Stuart Kenny and Kathryn Cassidy, Software Engineers with the Digital Repository of Ireland, at Open Repositories 2016 in Dublin.

Publié dans : Données & analyses
  • Soyez le premier à commenter

Stuart Kenny; Kathryn Cassidy - Experience with Ingestion of Large Collections at DRI

  1. 1. Experience with Ingestion of Large Collections Stuart Kenny Research IT Trinity College Dublin
  2. 2. Stuart Kenny Research IT Trinity College Dublin The Fairy Tales of Charles Perrault. Illustrated by Harry Clarke. Intro. Thomas Bodkin. London: George G. Harrap, [1922]. Internet Archive version of a copy in the New York Public Library. Web. 25 December 2012. My what a big collection you have!
  3. 3. About DRI (https://repository.dri.ie/) ● DRI is an interactive trusted digital repository for contemporary and historical, social and cultural data held by Irish institutions ● RIA (lead), NUIM, TCD, DIT, NUIG, NCAD ● Partners: academic, cultural, social, government
  4. 4. Outline • What’s our problem? • Example collections • Ingest solutions • Current ingest process • Possible future process
  5. 5. Ingesting Objects • Ingest form o Suitable for single objects/small collections o Flat hierarchies o Simple metadata standards • Multiple standards o e.g., MARC, EAD o XML upload • How to handle complex standards, many objects?
  6. 6. Example Collection: Clarke Stained Glass • MODS metadata • 10,025 objects • 42 sub-collections • 20,047 files, 2.82 TB • Problems: o Large number of objects o Data transfer
  7. 7. Example Collection: TCD Children’s Books • MARC metadata • 207,889 objects • 16 sub-collections • Problems: o Large number of objects o Very slow to ingest o Timeouts and errors
  8. 8. Example Collection: Kilkenny Design Workshop • EAD metadata • 2,040 objects • 2,734 series/files • 2,231 files, 1.2GB • Problems: o Very complex metadata standard o Hierarchical structure
  9. 9. EAD, and why I don’t quite hate it as much as I did... • Single XML file upload • Structure encoded in metadata • URLs to files • But o One-shot ingest o How to edit/update? o Slow to ingest o Requires a lot of resources
  10. 10. Sufia Batch Upload • Add multiple files • New work for each • Metadata for each work • How to handle multiple standards? • Different metadata for each work?
  11. 11. Avalon Batch Ingest • Ingest package o Manifest file o Plus content files • Manifest file is spreadsheet o Metadata for items o Names of content files • Ingest package uploaded to Avalon DropBox
  12. 12. Approach up to now • Command line client o Enter text commands at ‘command prompt’ • Written in Ruby • Run locally by user • Metadata and asset files arranged in fixed directory structure • Client iterates over directory creates each object as single ingest
  13. 13. Problems • Lack of user familiarity with command line • Multiple platform support o i.e., Windows • Difficulty of installing • Multiple single ingests o Slow o Error prone • Required lots of user support • Mostly in the end ingests performed by dev team
  14. 14. Current Attempt • Web-based UI • Borrow heavily from Avalon approach • Upload metadata XML plus assets to online storage • Add manifest spreadsheet o Each row contains path to metadata o Paths to zero or more asset files o Paths relative to online storage directory • Backend processes manifest and ingests as background task • UI updates status
  15. 15. Current Attempt UI Online Storage Repository Select manifest Retrieve remote files Ingest Update status
  16. 16. • Hydra BrowseEverything o Gem to access cloud storage o DropBox, Google Drive… • User uploads files • In UI selects collection and manifest to ingest • Everything handled server side in background • Can view status in UI
  17. 17. Outstanding Issues • Online storage o Dropbox type storage size limits • Creating spreadsheet less easy than directory structure • Possible solutions o Provide online storage o Has to be per user o Generate required manifest from uploaded directory structure

×