SlideShare a Scribd company logo
1 of 26
Download to read offline
Luis	
  Faria	
  lfaria@keep.pt
KEEP	
  SOLUTIONS	
  www.keep-­‐solu:ons.com
Alan	
  Akbik,	
  Barbara	
  Sierman,	
  Marcel	
  Ras,	
  Miguel	
  Ferreira,	
  José	
  Carlos	
  Ramalho
iPRES	
  2013
Lisbon,	
  September	
  2,	
  2013
Automa0c	
  Preserva0on	
  Watch
Using	
  Informa-on	
  Extrac-on	
  on	
  the	
  Web
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation
mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation
policies
2
Why do we need monitoring?
Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation
mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation
policies
3
Why do we need monitoring?
Risks
Opportunities
60%
40%
Yes but manual and adhoc
None
Risk Assessment
Survey on:
4
Scout:	
  a	
  preserva-on	
  watch	
  system
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Monitors	
  aspects	
  of	
  the	
  world	
  to	
  detect	
  preserva:on	
  risks	
  and	
  opportuni:es
5
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
6
Information Sources
• Format registries & software catalogues
• Digital repositories & web archives
• Organizational objectives
• Experiments
• Simulation
• Human knowledge
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
7
Currently supported information sources
• PRONOM
• Repository content and events
• Web archive content
• Web archive renderability experiments
• SCAPE Policy model
8
Define triggers
• Notify me when there are tools that can render the
format X.
9
Define triggers
Simple query with templates
10
Receive
notifications
Email
HTTP Push API
There	
  are	
  tools	
  that	
  can	
  render	
  format	
  X.
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Automa-c	
  Watch	
  Limita-ons
11
Machine readable data
• Explicit and formal specified information
• Controlled vocabulary
• Ontology
• All instances use same structure and set of values
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Case	
  study:	
  e-­‐Depot	
  coverage
12
0
100
200
300
400
500
600
40% 50% 60% 70% 80% 90% 100%
% of journal titles
Publishers Titles per publisher
97%
publishers
1-10
titles
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
e-­‐journal	
  coverage	
  ques-ons
13
• Which	
  publisher	
  provides	
  which	
  journal	
  -tles
• Publisher	
  changes:
• Ceases	
  to	
  provide	
  journal
• Transfers	
  journal	
  to	
  other	
  publisher(s)
• Publishers	
  merge
• Journal	
  changes:
• Name	
  changes
• ISSN	
  changes
• Ceased	
  to	
  exist
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Where	
  is	
  this	
  informa-on?
14
“In 1991, two years before the merger with Reed, Elsevier
acquired Pergamon Press in the UK.”
“The Asia-Europe Foundation (ASEF) sold the Asia Europe
Journal and transferred the copyright to its long-time partner
Springer.”
“Acta Chirurgica Iugoslavica is available free of charge as an
Open Access journal on the Internet.”
In the publisher website!
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Where	
  is	
  this	
  informa-on?
14
“In 1991, two years before the merger with Reed, Elsevier
acquired Pergamon Press in the UK.”
“The Asia-Europe Foundation (ASEF) sold the Asia Europe
Journal and transferred the copyright to its long-time partner
Springer.”
“Acta Chirurgica Iugoslavica is available free of charge as an
Open Access journal on the Internet.”
In the publisher website!
Not
machine
readable!
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Informa-on	
  Extrac-on
• Extract structural information from unstructured data
• Pattern-based information extraction
• Some training and supervision may be needed
15
“[X] acquired [Y]”
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Experiment
1. Data acquisition and pre-processing
2. Relation discovery
3. Information extraction
4. Validation of results
16
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
1.	
  Data	
  acquisi-on	
  and	
  pre-­‐processing
• Focused crawler with seed words (12.000 entries)
• Publisher names
• Journal titles
➡500.000 Web pages
• Pre-process with NLP tools
➡18 million sentences
➡8 GB
17
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
2.	
  Rela-on	
  discovery
18
Prominent pattern Rank
[X] journal of [Y] 1
[X] published by [Y] 2
[X] journal on [Y] 3
[X] journal published by [Y] 4
[X] available as [Y] journal 5
PubMed [X] [Y] 9
[X] science proceedings of [Y] 25
[X] subscription available to [Y] 30
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
2.	
  Rela-on	
  discovery
19
Prominent pattern Rank
[X] journal of [Y] 1
[X] published by [Y] 2
[X] journal on [Y] 3
[X] journal published by [Y] 4
[X] available as [Y] journal 5
PubMed [X] [Y] 9
[X] science proceedings of [Y] 25
[X] subscription available to [Y] 30
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
3.	
  Informa-on	
  extrac-on
20
2.000 journal titles
500 journal-publisher attributions
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
4.	
  Valida-on	
  of	
  results
21
4%
10%
86%
Journal titles in eDepot
15%
50%
35%
Title-publisher in the Keepers registry
Should add Existing
False-positives
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
False-­‐posi-ves
• Detecting boundaries of titles and publisher names
• Using abbreviations on titles and publisher names
• Technical problems like encoding
22
“European Journal of Nuclear Medicine and Molecular Imaging”
IAAE - “International Association of Agricultural Economists”
“├ó╦å┼buda University”
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Conclusions
• We need data to support digital preservation
• Explicit and formal specified for automation
• Registries tend to be incomplete and outdated
• Information Extraction Technologies can help
• Still, some supervision may be needed
23
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Send	
  us	
  your	
  use	
  cases!
24
Alan Akbik
alan.akbik@tu-berlin.de
Luis Faria
lfaria@keep.pt
Preservation Watch
What risks to monitor?
Information Extraction
What to extract from the web?
This	
  work	
  was	
  par,ally	
  supported	
  by	
  the	
  SCAPE	
  Project.
The	
  SCAPE	
  project	
  is	
  co-­‐funded	
  by	
  the	
  European	
  Union	
  under	
  FP7	
  ICT-­‐2009.4.1	
  (Grant	
  Agreement	
  number	
  270137).
Thank	
  you,	
  ques-ons?
• Scout - a preservation watch system
• Site: http://openplanets.github.io/scout/
• Demo: http://scout.scape.keep.pt
• SCAPE Planning and Watch suite iPRES poster
• http://bit.ly/scape-pw
• SCAPE
• http://www.scape-project.eu
25

More Related Content

What's hot

Intelligent tools-mitja-jermol-2013-bali-7 may2013
Intelligent tools-mitja-jermol-2013-bali-7 may2013Intelligent tools-mitja-jermol-2013-bali-7 may2013
Intelligent tools-mitja-jermol-2013-bali-7 may2013
MediaMixerCommunity
 
The European life-science data infrastructure: Data, Computing and Services ...
The European life-science data infrastructure: Data, Computing and Services ...The European life-science data infrastructure: Data, Computing and Services ...
The European life-science data infrastructure: Data, Computing and Services ...
Rafael C. Jimenez
 
Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013
Europeana Newspapers
 

What's hot (20)

Intelligent tools-mitja-jermol-2013-bali-7 may2013
Intelligent tools-mitja-jermol-2013-bali-7 may2013Intelligent tools-mitja-jermol-2013-bali-7 may2013
Intelligent tools-mitja-jermol-2013-bali-7 may2013
 
Per Blixt - IPv6 deployment, taking stock and next steps?
Per Blixt - IPv6 deployment, taking stock and next steps?Per Blixt - IPv6 deployment, taking stock and next steps?
Per Blixt - IPv6 deployment, taking stock and next steps?
 
Experience in managing service portfolio by Pasquale Pagano
Experience in managing service portfolio by Pasquale PaganoExperience in managing service portfolio by Pasquale Pagano
Experience in managing service portfolio by Pasquale Pagano
 
Per Blixt
Per BlixtPer Blixt
Per Blixt
 
OpenAIRE NOADs
OpenAIRE NOADsOpenAIRE NOADs
OpenAIRE NOADs
 
1st Technical Meeting - WP8
1st Technical Meeting - WP81st Technical Meeting - WP8
1st Technical Meeting - WP8
 
Archiver 3rd omc_project_overview
Archiver 3rd omc_project_overviewArchiver 3rd omc_project_overview
Archiver 3rd omc_project_overview
 
The European life-science data infrastructure: Data, Computing and Services ...
The European life-science data infrastructure: Data, Computing and Services ...The European life-science data infrastructure: Data, Computing and Services ...
The European life-science data infrastructure: Data, Computing and Services ...
 
New toolkit introduced by the energy infrastructure package
New toolkit introduced by the energy infrastructure packageNew toolkit introduced by the energy infrastructure package
New toolkit introduced by the energy infrastructure package
 
Prototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and CeremonyPrototype Phase Kick-off Event and Ceremony
Prototype Phase Kick-off Event and Ceremony
 
Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013Europeana Newspapers wp2 liber2013
Europeana Newspapers wp2 liber2013
 
Webinar on OpenAIRE compatibility for repositories: EPrints repository platform
Webinar on OpenAIRE compatibility for repositories: EPrints repository platform Webinar on OpenAIRE compatibility for repositories: EPrints repository platform
Webinar on OpenAIRE compatibility for repositories: EPrints repository platform
 
SLOPE Final Conference - 3D harvesting planner
SLOPE Final Conference - 3D harvesting plannerSLOPE Final Conference - 3D harvesting planner
SLOPE Final Conference - 3D harvesting planner
 
Fire at Net Futures2015
Fire at Net Futures2015Fire at Net Futures2015
Fire at Net Futures2015
 
Design phase kick-off event and Ceremony
Design phase kick-off event and CeremonyDesign phase kick-off event and Ceremony
Design phase kick-off event and Ceremony
 
SLOPE Final Conference - intelligent truck
SLOPE Final Conference - intelligent truckSLOPE Final Conference - intelligent truck
SLOPE Final Conference - intelligent truck
 
FIRE slideshow @ECFI-2
FIRE slideshow @ECFI-2FIRE slideshow @ECFI-2
FIRE slideshow @ECFI-2
 
FIRE Brochure 2014 multimedia eBook -version
FIRE Brochure 2014 multimedia eBook -versionFIRE Brochure 2014 multimedia eBook -version
FIRE Brochure 2014 multimedia eBook -version
 
Policy Making: A Powerful Tool
Policy Making: A Powerful ToolPolicy Making: A Powerful Tool
Policy Making: A Powerful Tool
 
Archiver pilot phase kick off Award Ceremony
Archiver pilot phase kick off Award CeremonyArchiver pilot phase kick off Award Ceremony
Archiver pilot phase kick off Award Ceremony
 

Viewers also liked

Viewers also liked (11)

Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification tools
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 

Similar to Automatic Preservation Watch

TAROT summerschool slides 2013 - Italy
TAROT summerschool slides 2013 - ItalyTAROT summerschool slides 2013 - Italy
TAROT summerschool slides 2013 - Italy
Tanja Vos
 
ENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introduction
Europeana Newspapers
 
Value&impact research dataservices_idcc_2017
Value&impact  research dataservices_idcc_2017Value&impact  research dataservices_idcc_2017
Value&impact research dataservices_idcc_2017
Neil Beagrie
 
Ia4 si caps concertation presentation
Ia4 si caps concertation presentationIa4 si caps concertation presentation
Ia4 si caps concertation presentation
CAPS2020
 

Similar to Automatic Preservation Watch (20)

SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
ExPaNDS
ExPaNDSExPaNDS
ExPaNDS
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National Library
 
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
PaNOSC Overview - ExPaNDS kick-off meeting - September 2019
 
PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users PaNOSC: EOSC for Photon and Neutron Facilities Users
PaNOSC: EOSC for Photon and Neutron Facilities Users
 
Europeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers OnlineEuropeana Newspapers - the Gateway to European Newspapers Online
Europeana Newspapers - the Gateway to European Newspapers Online
 
Refinement of Digitised Newspapers
Refinement of Digitised NewspapersRefinement of Digitised Newspapers
Refinement of Digitised Newspapers
 
TAROT summerschool slides 2013 - Italy
TAROT summerschool slides 2013 - ItalyTAROT summerschool slides 2013 - Italy
TAROT summerschool slides 2013 - Italy
 
Europeana Newspapers in a nutshell
Europeana Newspapers in a nutshellEuropeana Newspapers in a nutshell
Europeana Newspapers in a nutshell
 
EurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_NeudeckerEurnewsLDN_Clemens_Neudecker
EurnewsLDN_Clemens_Neudecker
 
TAROT2013 Testing School - Tanja Vos presentation
TAROT2013 Testing School - Tanja Vos presentationTAROT2013 Testing School - Tanja Vos presentation
TAROT2013 Testing School - Tanja Vos presentation
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
ENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introductionENP Belgrade WS refinement introduction
ENP Belgrade WS refinement introduction
 
Value&impact research dataservices_idcc_2017
Value&impact  research dataservices_idcc_2017Value&impact  research dataservices_idcc_2017
Value&impact research dataservices_idcc_2017
 
PaNOSC and Research Data Management / Battery2030+ Initiative Workshop / 12 M...
PaNOSC and Research Data Management / Battery2030+ Initiative Workshop / 12 M...PaNOSC and Research Data Management / Battery2030+ Initiative Workshop / 12 M...
PaNOSC and Research Data Management / Battery2030+ Initiative Workshop / 12 M...
 
The Fertigation bible
The Fertigation bibleThe Fertigation bible
The Fertigation bible
 
ENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilmsENP_Dutch_Infoday_LWilms
ENP_Dutch_Infoday_LWilms
 
COMPARE: A global platform for the sequence-based rapid identification of pat...
COMPARE: A global platform for the sequence-based rapid identification of pat...COMPARE: A global platform for the sequence-based rapid identification of pat...
COMPARE: A global platform for the sequence-based rapid identification of pat...
 
Ia4 si caps concertation presentation
Ia4 si caps concertation presentationIa4 si caps concertation presentation
Ia4 si caps concertation presentation
 

More from SCAPE Project

Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
SCAPE Project
 

More from SCAPE Project (13)

Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Automatic Preservation Watch

  • 1. Luis  Faria  lfaria@keep.pt KEEP  SOLUTIONS  www.keep-­‐solu:ons.com Alan  Akbik,  Barbara  Sierman,  Marcel  Ras,  Miguel  Ferreira,  José  Carlos  Ramalho iPRES  2013 Lisbon,  September  2,  2013 Automa0c  Preserva0on  Watch Using  Informa-on  Extrac-on  on  the  Web
  • 2. Repository Format obsolescence Emerging technology Consumer trends New standards Organisation mission Bit rot Resource capability System availability Security breach Economical limitations Social and political factors Producer trends Organisation policies 2 Why do we need monitoring?
  • 3. Repository Format obsolescence Emerging technology Consumer trends New standards Organisation mission Bit rot Resource capability System availability Security breach Economical limitations Social and political factors Producer trends Organisation policies 3 Why do we need monitoring? Risks Opportunities
  • 4. 60% 40% Yes but manual and adhoc None Risk Assessment Survey on: 4
  • 5. Scout:  a  preserva-on  watch  system This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Monitors  aspects  of  the  world  to  detect  preserva:on  risks  and  opportuni:es 5
  • 6. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 6 Information Sources • Format registries & software catalogues • Digital repositories & web archives • Organizational objectives • Experiments • Simulation • Human knowledge
  • 7. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 7 Currently supported information sources • PRONOM • Repository content and events • Web archive content • Web archive renderability experiments • SCAPE Policy model
  • 8. 8 Define triggers • Notify me when there are tools that can render the format X.
  • 10. 10 Receive notifications Email HTTP Push API There  are  tools  that  can  render  format  X.
  • 11. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Automa-c  Watch  Limita-ons 11 Machine readable data • Explicit and formal specified information • Controlled vocabulary • Ontology • All instances use same structure and set of values
  • 12. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Case  study:  e-­‐Depot  coverage 12 0 100 200 300 400 500 600 40% 50% 60% 70% 80% 90% 100% % of journal titles Publishers Titles per publisher 97% publishers 1-10 titles
  • 13. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). e-­‐journal  coverage  ques-ons 13 • Which  publisher  provides  which  journal  -tles • Publisher  changes: • Ceases  to  provide  journal • Transfers  journal  to  other  publisher(s) • Publishers  merge • Journal  changes: • Name  changes • ISSN  changes • Ceased  to  exist
  • 14. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Where  is  this  informa-on? 14 “In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.” “The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.” “Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.” In the publisher website!
  • 15. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Where  is  this  informa-on? 14 “In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.” “The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.” “Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.” In the publisher website! Not machine readable!
  • 16. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Informa-on  Extrac-on • Extract structural information from unstructured data • Pattern-based information extraction • Some training and supervision may be needed 15 “[X] acquired [Y]”
  • 17. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Experiment 1. Data acquisition and pre-processing 2. Relation discovery 3. Information extraction 4. Validation of results 16
  • 18. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 1.  Data  acquisi-on  and  pre-­‐processing • Focused crawler with seed words (12.000 entries) • Publisher names • Journal titles ➡500.000 Web pages • Pre-process with NLP tools ➡18 million sentences ➡8 GB 17
  • 19. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 2.  Rela-on  discovery 18 Prominent pattern Rank [X] journal of [Y] 1 [X] published by [Y] 2 [X] journal on [Y] 3 [X] journal published by [Y] 4 [X] available as [Y] journal 5 PubMed [X] [Y] 9 [X] science proceedings of [Y] 25 [X] subscription available to [Y] 30
  • 20. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 2.  Rela-on  discovery 19 Prominent pattern Rank [X] journal of [Y] 1 [X] published by [Y] 2 [X] journal on [Y] 3 [X] journal published by [Y] 4 [X] available as [Y] journal 5 PubMed [X] [Y] 9 [X] science proceedings of [Y] 25 [X] subscription available to [Y] 30
  • 21. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 3.  Informa-on  extrac-on 20 2.000 journal titles 500 journal-publisher attributions
  • 22. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 4.  Valida-on  of  results 21 4% 10% 86% Journal titles in eDepot 15% 50% 35% Title-publisher in the Keepers registry Should add Existing False-positives
  • 23. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). False-­‐posi-ves • Detecting boundaries of titles and publisher names • Using abbreviations on titles and publisher names • Technical problems like encoding 22 “European Journal of Nuclear Medicine and Molecular Imaging” IAAE - “International Association of Agricultural Economists” “├ó╦å┼buda University”
  • 24. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Conclusions • We need data to support digital preservation • Explicit and formal specified for automation • Registries tend to be incomplete and outdated • Information Extraction Technologies can help • Still, some supervision may be needed 23
  • 25. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Send  us  your  use  cases! 24 Alan Akbik alan.akbik@tu-berlin.de Luis Faria lfaria@keep.pt Preservation Watch What risks to monitor? Information Extraction What to extract from the web?
  • 26. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Thank  you,  ques-ons? • Scout - a preservation watch system • Site: http://openplanets.github.io/scout/ • Demo: http://scout.scape.keep.pt • SCAPE Planning and Watch suite iPRES poster • http://bit.ly/scape-pw • SCAPE • http://www.scape-project.eu 25