At the iPres2013 conference in Lisbon, Portugal, in September 2013 Luís Faria, KEEP SOLUTIONS LDA, presented SCAPE work on monitoring of digital repositories and the tool, Scout, which has been developed in this connection. Scout is a web-based service that assists content holders in monitoring their digital repository and provides an ontological knowledge base for compiling the information needed to detect preservation risks and opportunities.
Breaking the Kubernetes Kill Chain: Host Path Mount
Automatic Preservation Watch
1. Luis
Faria
lfaria@keep.pt
KEEP
SOLUTIONS
www.keep-‐solu:ons.com
Alan
Akbik,
Barbara
Sierman,
Marcel
Ras,
Miguel
Ferreira,
José
Carlos
Ramalho
iPRES
2013
Lisbon,
September
2,
2013
Automa0c
Preserva0on
Watch
Using
Informa-on
Extrac-on
on
the
Web
2. Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation
mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation
policies
2
Why do we need monitoring?
3. Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation
mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation
policies
3
Why do we need monitoring?
Risks
Opportunities
5. Scout:
a
preserva-on
watch
system
This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Monitors
aspects
of
the
world
to
detect
preserva:on
risks
and
opportuni:es
5
6. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
6
Information Sources
• Format registries & software catalogues
• Digital repositories & web archives
• Organizational objectives
• Experiments
• Simulation
• Human knowledge
7. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
7
Currently supported information sources
• PRONOM
• Repository content and events
• Web archive content
• Web archive renderability experiments
• SCAPE Policy model
11. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Automa-c
Watch
Limita-ons
11
Machine readable data
• Explicit and formal specified information
• Controlled vocabulary
• Ontology
• All instances use same structure and set of values
12. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Case
study:
e-‐Depot
coverage
12
0
100
200
300
400
500
600
40% 50% 60% 70% 80% 90% 100%
% of journal titles
Publishers Titles per publisher
97%
publishers
1-10
titles
13. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
e-‐journal
coverage
ques-ons
13
• Which
publisher
provides
which
journal
-tles
• Publisher
changes:
• Ceases
to
provide
journal
• Transfers
journal
to
other
publisher(s)
• Publishers
merge
• Journal
changes:
• Name
changes
• ISSN
changes
• Ceased
to
exist
14. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Where
is
this
informa-on?
14
“In 1991, two years before the merger with Reed, Elsevier
acquired Pergamon Press in the UK.”
“The Asia-Europe Foundation (ASEF) sold the Asia Europe
Journal and transferred the copyright to its long-time partner
Springer.”
“Acta Chirurgica Iugoslavica is available free of charge as an
Open Access journal on the Internet.”
In the publisher website!
15. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Where
is
this
informa-on?
14
“In 1991, two years before the merger with Reed, Elsevier
acquired Pergamon Press in the UK.”
“The Asia-Europe Foundation (ASEF) sold the Asia Europe
Journal and transferred the copyright to its long-time partner
Springer.”
“Acta Chirurgica Iugoslavica is available free of charge as an
Open Access journal on the Internet.”
In the publisher website!
Not
machine
readable!
16. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Informa-on
Extrac-on
• Extract structural information from unstructured data
• Pattern-based information extraction
• Some training and supervision may be needed
15
“[X] acquired [Y]”
17. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Experiment
1. Data acquisition and pre-processing
2. Relation discovery
3. Information extraction
4. Validation of results
16
18. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
1.
Data
acquisi-on
and
pre-‐processing
• Focused crawler with seed words (12.000 entries)
• Publisher names
• Journal titles
➡500.000 Web pages
• Pre-process with NLP tools
➡18 million sentences
➡8 GB
17
19. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
2.
Rela-on
discovery
18
Prominent pattern Rank
[X] journal of [Y] 1
[X] published by [Y] 2
[X] journal on [Y] 3
[X] journal published by [Y] 4
[X] available as [Y] journal 5
PubMed [X] [Y] 9
[X] science proceedings of [Y] 25
[X] subscription available to [Y] 30
20. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
2.
Rela-on
discovery
19
Prominent pattern Rank
[X] journal of [Y] 1
[X] published by [Y] 2
[X] journal on [Y] 3
[X] journal published by [Y] 4
[X] available as [Y] journal 5
PubMed [X] [Y] 9
[X] science proceedings of [Y] 25
[X] subscription available to [Y] 30
21. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
3.
Informa-on
extrac-on
20
2.000 journal titles
500 journal-publisher attributions
22. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
4.
Valida-on
of
results
21
4%
10%
86%
Journal titles in eDepot
15%
50%
35%
Title-publisher in the Keepers registry
Should add Existing
False-positives
23. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
False-‐posi-ves
• Detecting boundaries of titles and publisher names
• Using abbreviations on titles and publisher names
• Technical problems like encoding
22
“European Journal of Nuclear Medicine and Molecular Imaging”
IAAE - “International Association of Agricultural Economists”
“├ó╦å┼buda University”
24. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Conclusions
• We need data to support digital preservation
• Explicit and formal specified for automation
• Registries tend to be incomplete and outdated
• Information Extraction Technologies can help
• Still, some supervision may be needed
23
25. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Send
us
your
use
cases!
24
Alan Akbik
alan.akbik@tu-berlin.de
Luis Faria
lfaria@keep.pt
Preservation Watch
What risks to monitor?
Information Extraction
What to extract from the web?
26. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Thank
you,
ques-ons?
• Scout - a preservation watch system
• Site: http://openplanets.github.io/scout/
• Demo: http://scout.scape.keep.pt
• SCAPE Planning and Watch suite iPRES poster
• http://bit.ly/scape-pw
• SCAPE
• http://www.scape-project.eu
25