This presentation origins from a webinar presented by Luís Faria. The webinar presents the SCAPE developed tools Scout and C3PO and demonstrates how to identify preservation risks in your content and, at the same time, share your content profile information with others to open new opportunities.
Scout, the preservation watch system, centralizes all the necessary knowledge on the same platform, cross-referencing this knowledge to uncover all preservation risks. Scout automatically fetches information from several sources to populate its knowledge base. For example, Scout integrates with C3PO to get large-scale characterization profiles of content. Furthermore, Scout aims to be a knowledge exchange platform, to allow the community to bring together all the necessary information into the system. The sharing of information opens new opportunities for joining forces against common problems.
The webinar was held 26 June 2014.
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Tools for Uncovering Preservation Risks in Large Repositories
1. Luis
Faria
lfaria@keep.pt
KEEP
SOLUTIONS
www.keep-‐solu=ons.com
SCAPE
webminar
July
26,
2014
Tools
for
uncovering
preserva=on
risks
in
your
large
repositories
2. Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation
mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation
policies
2
Why do we need monitoring?
3. Repository
Format obsolescence
Emerging technology
Consumer trends
New standards
Organisation
mission
Bit rot
Resource capability
System availability
Security breach
Economical limitations Social and political factors
Producer trends
Organisation
policies
3
Why do we need monitoring?
Risks
Opportunities
4. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
4
5.41%&
0.77%&
1.54%&
1.93%&
2.32%&
2.70%&
2.70%&
5.02%&
7.34%&
9.27%&
15.83%&
26.64%&
28.57%&
0.00%& 5.00%& 10.00%& 15.00%& 20.00%& 25.00%& 30.00%&
Other&
Data&intensive&industry&
Non&affiliated&
Big&data&science&
Digital&preservaDon&vendor&
Research&funder&
Large&enterprise&
Publisher&or&content&producer&
Small&or&medium&enterprise&
Local&government&insDtuDon&
NaDonal&government&insDtuDon&
Memory&insDtuDon&or&content&holder&
University&
What%descrip-ons%fit%your%organiza-on?%
Preserva'on
monitoring
survey
181 valid
par=cipants
5. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Preserva'on
monitoring
survey
5
92%$
89%$
78%$
77%$
76%$
76%$
75%$
74%$
69%$
68%$
64%$
41%$
51%$
41%$
40%$
44%$
23%$
27%$
17%$
28%$
25%$
30%$
18%$
9%$
18%$
13%$
12%$
24%$
22%$
25%$
25%$
19%$
23%$
41%$
40%$
41%$
46%$
44%$
53%$
51%$
58%$
47%$
55%$
46%$
0.00%$ 10.00%$ 20.00%$ 30.00%$ 40.00%$ 50.00%$ 60.00%$ 70.00%$ 80.00%$ 90.00%$ 100.00%$
File$corrup7on$
Backup$failure$
Staff$not$enough$or$adequate$
SoDware$plaForm$obsolescence$
Hardware$plaForm$obsolescence$
Lack$of$context$informa7on$
Incorrect$ac7on$results$
Consumers$misalignment$
Outdated$preserva7on$plans$
Producers$misalignment$
Content$not$aligned$with$policies$
Importance$(normalized$mean)$ Monitoring$ Not$monitoring$ Uncertain$or$No$answer$
6. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
6
Tools
for
uncovering
preserva'on
risks
Content FITS C3PO Scout
FITS
output
(XML)
</>
File
characteris=cs
distribu=on
(graphs
and
drill-‐down
analysis)
File
and
world
proper=es
throughout
=me
and
no=fica=ons
7. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
• hp://fitstool.org
• Characteriza=on
• Iden=fica=on
• Feature
extrac=on
• Valida=on
• Support
for:
• DROID
• JHove
• Apache
Tika
• ADL
Tool
• Exidool
• FFIdent
• File
U=lity
(windows
port)
• NLNZ
Metadata
Extractor
• OIS
Audio,
File
and
XML
Informa=on
FITS
-‐
File
Informa'on
Tool
Set
• hps://github.com/keeps/fits/tree/keeps
• Developed
by
KEEPS
• Added
support
for:
• FIDO
• Microsod
Office
• Adobe
Illustrator
• Corel
Draw
• Email
(EML)
• Autocad
(DWG)
• Shapefile
• RTF,
TXT
• Databases
(DBML)
7
8. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
FITS
-‐
File
Informa'on
Tool
Set
• Demonstra=on
• Download
from
hp://fitstool.org
!
• Execute
for
a
file
!
!
• Execute
for
a
directory
8
./fits.sh
-‐i
file.png
./fits.sh
-‐r
-‐i
source_directory/
-‐o
output_directory/
9. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
FITS
performance
• hps://github.com/keeps/fits-‐tes=ng
• 3
to
6
seconds
per
file
• 12
TB
-‐
A
year
• hp://www.openplanetsfounda=on.org/blogs/2013-‐01-‐09-‐year-‐fits
• Other
op=ons
for
scalability:
• Fido
• Apache
Tika
• Nanite
9
10. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
C3PO
-‐
Clever,
Cra?y
Content
Profile
of
Objects
• hp://ifs.tuwien.ac.at/imp/c3po
• Web
applica=on
• Content
characteris=cs
aggrega=on
• Drill-‐down
analysis
10
11. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
C3PO
install
• Download
binaries
at:
• hp://dl.bintray.com/peshkira/c3po/
• Install
mongodb:
• hp://www.mongodb.org/
• Install
Apache
Tomcat
• hp://tomcat.apache.org/
• Put
C3PO
web
app
in
Apache
Tomcat
• Remove
ROOT
dir
for
webapps
and
rename
C3PO
web
app
to
ROOT.war
• Start
Apache
Tomcat
and
connect
to:
• hp://localhost:8080/
• Usage
guide:
• hps://github.com/peshkira/c3po/wiki/Usage-‐Guide
11
12. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
C3PO
performance
Dataset:
Statsbiblioteket
(Denmark)
• Size:
440M
files
(12
TB)
• Process
=me:
388h
(16
days)
/
50h
for
XML
report
• Average
=me:
2.5s
per
1000
files
• Web
applica=on
has
2.5
million
FITS
files
limit
12
13. Scout:
a
preserva'on
watch
system
This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Monitors
aspects
of
the
world
to
detect
preserva=on
risks
and
opportuni=es
13
Content
Policies
Web
Scout
Risk notification
Human
knowledge
Registries
14. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
14
Information Sources
• Format registries & software catalogues
• Digital repositories & web archives
• Organizational objectives
• Experiments
• Simulation
• Human knowledge
15. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
15
Current information sources
• Repository content and events
• SCAPE Policy model
• PRONOM
• Web semantic extraction
• Web page renderability experiments
20. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
How to be a part of Scout
• Checkout
• Site: http://openplanets.github.io/scout/
• Report: http://www.scape-project.eu/deliverable/d12-2-
final-version-of-the-preservation-watch-component
• Demo: http://scout.scape.keep.pt
• Integrate your content
• Contribute with information (soon)
• Use Scout form for manual input of knowledge
20
21. This
work
was
par,ally
supported
by
the
SCAPE
Project.
The
SCAPE
project
is
co-‐funded
by
the
European
Union
under
FP7
ICT-‐2009.4.1
(Grant
Agreement
number
270137).
Roadmap
• User
support
• More
trigger
templates
• More
adaptors
• KrakeN
/
Propminer
• Sodware
catalogues
• Other
format
registries
• Other
experiments
informa=on
sources
• Manual
input
(human
knowledge)
• Simula=on
21
22. Luis
Faria
lfaria@keep.pt
KEEP
SOLUTIONS
www.keep-‐solu=ons.com
SCAPE
webminar
July
26,
2014
Tools
for
uncovering
preserva=on
risks
in
large
repositories