SlideShare a Scribd company logo
1 of 46
CLEAR+: a Credible Live
Evaluation Method of
Website Archivability
Vangelis Banos, Yannis Manolopoulos
3 JUNE 2015
NATIONAL DIGITAL INFORMATION INFRASTRUCTURE AND PRESERVATION PROGRAM
LIBRARY OF CONGRESS
Data Engineering Lab
Department of Informatics, Aristotle University, Thessaloniki , Greece
ARCHIVEREADY.COM
Website Archivability 2
Table of Contents
1. Motivation and problem definition, related work,
2. Website Archivability,
3. CLEAR+: A Credible Live method to Evaluate
Website Archivability,
4. Demonstration: http://archiveready.com/,
5. Experimental Evaluation,
6. Use Cases,
7. Web Content Management Systems Archivability
8. Discussion – conclusions.
1. Motivation
• Web developer: I’m building a website. Is it
going to be archived correctly by a web archive?
I don’t know until I see the archived snapshot…
• Web archivist: Can I archive that website?
I don’t know, let’s crawl it and we’ll see the results…
• Professor: How can I teach my students about
web archiving?
100’s of standards but not many relevant apps online…
3
Problem definition
• Web content acquisition is a critical step in the
process of web archiving;
• If the initial Submission Information Package lacks
completeness and accuracy for any reason (e.g.
missing or invalid web content), the rest of the
preservation processes are rendered useless;
• There is no guarantee that web bots dedicated to
retrieving website content can access and retrieve
it successfully;
• Web bots face increasing difficulties in harvesting
websites.
4
5
• Web harvesting is automated while Quality Assurance
(QA) is mostly manual.
• Web archives perform test crawls.
• Humans review the results, resources are spent.
• After web harvesting, administrators review manually
the content and endorse or reject the harvested
material.
• Efforts to deploy crowdsourced techniques to
manage QA provide an indication of how significant the
bottleneck is.
• (IIPC GA 2012 Crowdsourcing Workshop)
Problem definition
6
1. the introduction of the notion of Website Archivability,
2. the Credible Live Evaluation of Archive Readiness Plus
(CLEAR+) method to measure Website Archivability
3. ArchiveReady.com, a web application which implements
the proposed method.
Publications:
• Banos V., Manolopoulos Y.: A quantitative approach to
evaluate Website Archivability using the CLEAR+ method,
International Journal on Digital Libraries (IJDL), 2015.
• Banos V., Kim Y., Ross S., Manolopoulos Y.: CLEAR: a
credible method to evaluate website archivability,
iPRES’2013, Lisbon, 2013.
2. Our Contributions
7
1. Mechanism to improve the quality of web archives.
2. Expand and optimize the knowledge and practices of
web archivists, supporting them in their decision
making, and risk management.
3. Standardize the web aggregation practices of web
archives, especially QA.
4. Foster good practices in web development, make
sites more amenable to harvesting, ingesting, and
preserving.
5. Raise awareness among web professionals regarding
preservation.
6. Support web archiving training.
Our Aims
Website
Archivability ?
What is
Website Archivability (WA) captures the core
aspects of a website crucial in diagnosing
whether it has the potentiality to be archived
with completeness and accuracy.
Attention! it must not be confused with website dependability,
reliability, availability, safety, security, survivability, maintainability.
CLEAR+: A Credible Live Method to
Evaluate Website Archivability
• An approach to producing on-the-fly measurement
of Website Archivability,
• Web archives communicate with target websites via
standard HTTP,
• Information such as file types, content and transfer
errors could be used to support archival decisions,
• We combine this kind of information with an
evaluation of the website's compliance with
recognized practices in digital curation,
• We generate a credible score representing the
archivability of target websites.
9
The main components of CLEAR+
1. WA Facets: the factors that come into play and
need to be taken into account to calculate total WA.
2. Website Attributes: the website homepage
elements analysed to assess the WA Facets (e.g. the
HTML markup code).
3. Evaluations: the tests executed on the website
attributes (e.g. HTML code validation against W3C
HTML standards) and approach used to combine
the test results to calculate the WA metrics.
10
11
Accessibility Cohesion
Standards
Compliance
Metadata
CLEAR+: A Credible Live Method to Evaluate
Website Archivability
12
Website attributes evaluated using CLEAR+
13
CLEAR+ Evaluations
1. Perform specific Evaluations on Website Attributes,
2. In order to calculate each Archivability Facet’s score:
• Scores range from (0 – 100%),
• Evaluations significance varies:
• High: critical issues which prevent web crawling or may
cause highly problematic web archiving results.
• Medium: issues which are not critical but may affect the
quality of web archiving results.
• Low: minor details which do not cause any issues when
they are missing but will help web archiving when
available
3. Website Archivability is the average of all Facets’ scores.
Accessibility
14
Accessibility
• A website is considered accessible only if web
crawlers are able to visit its home page, traverse its
content and retrieve it via standard HTTP requests.
• Performance is also an important aspect of web
archiving. Faster performance means faster web
content ingestion.
15
Accessibility Evaluations
16
Cohesion
17
Cohesion
• Relevant to:
• Efficient operation of web crawlers,
• Management of dependancies with digital
curation.
• If files constituting a single website are dispersed
across different web locations, the acquisition and
ingest is likely to risk suffering if one or more web
locations fail.
• Changes that occur outside the website are not
going to affect it if it does not use 3rd party
resources.
18
Cohesion Evaluations
19
Metadata
20
Metadata
• The adequate provision of metadata has been a
continuing concern within digital curation.
• The lack of metadata impairs the archive’s ability to
manage, organise, retrieve and interact with content
effectively.
• Metadata may include descriptive or technical
information.
• Metadata increases the probability of successful
information extraction and reuse in web archives
after ingestion.
21
Metadata Evaluations
22
Standards
Compliance
23
Standards Compliance
• Compliance with standards is a recurring theme in
digital curation practices. It is recommended that for
digital resources to be preserved they need to be
represented in known and transparent standards.
24
Standards Compliance Evaluations
25
ArchiveReady.com
4. Demonstration
- Web application implementing CLEAR+,
- Web interface & also Web API in JSON,
- Running on Linux, Python, Nginx, Redis, Mysql,
PhantomJS headless browser.
26
archiveready.com DEMO
27
5. Experimental
Evaluation
28
5. Experimental evaluation
• Questions:
– How can we prove the validity of the Website
Archivability metric?
– Is it possible to calculate the WA of a website by
evaluating a single webpage?
29
Experiment 1: Evaluation using datasets
30
Experiment 1: Evaluation using datasets
31
Experiment 2: Evaluation by experts
• Experts rank 200 websites according to the quality
of their snapshots at the Internet Archive
• We evaluate the same websites with
archiveready.com
• We calculate the Pearson’s Correlation Coefficient
of our variables and find correlations.
32
Experiment 3: WA variance in the pages
of the same website
• We evaluate only a single webpage to
calculate website archivability. Is this correct?
• Is the homepage WA representative of the
whole website WA?
• We use a website of 800 webpages and
calculate the WA of 10 different webpages for
each website to find out.
33
Experiment 3: WA variation in the
pages of the same website
34
6. Use Cases
35
Use Case 1: Deutsches Literatur Archiv,
Marbach, Germany
• German literature web archiving project,
• http://www.dla-marbach.de/dla/bibliothek/literatur_im_netz/netzliteratur/
• ~3.000 websites are preserved,
• An evaluation of the archivability
characteristics of these websites was
necessary before crawling,
• archiveready.com API was used to gain an
insight on their properties
http://archiveready.com/docs/api.html
36
Use Case 2: Academia
• Used by digital curation units, researchers and
teachers.
– University of Newcastle, UK,
– Columbia University Libraries,
– Stanford University Libraries,
– University of Michigan, Bentley Historical Library,
– Old Dominion University.
37
7. Web CMS
Archivability
38
Web CMS Archivability
• CMS dominate the web
– (Wordpress, Drupal, Joomla, MovableType, +++)
• CMS constitute a common technical
framework for web publishing.
• If a CMS is ‘incompatible’ with some web
archiving aspect, millions of websites are
affected and web archives suffer.
39
Web CMS Archivability
• Our contribution:
– We study 12 prominent web CMS.
– We conduct experiments with a sample of ~5.800
websites based on these CMS.
– We make specific observations on the Website
Archivability characteristics of each CMS.
• Paper (under review):
– Web Content Management Systems Archivability,
Banos V., Manolopoulos Y., ADBIS 2015’
40
Web CMS Archivability
41
Web CMS Archivability
• Indicative results:
– Drupal has the third highest WA score (82.08%). It has
good overall performance and the only issue is the
existence of too many inline scripts per instance
(15.09).
– DotNetNuke has the second worst WA score in our
evaluation (77.2%). We suggest that they look into
their RSS feeds (13% Correct). and lacking HTTP
caching support (5%).
– Typo3 WA score is average (79%). It has the largest
number of invalid URLs per instance (12%).
42
8. Discussion &
Conclusions
43
Discussion and conclusions
44
• Introducing a new metric to quantify the previously
unquantifiable notion of WA is not an easy task.
• CLEAR+ and Website Archivability capture the core
aspects of a website crucial in diagnosing whether it
has the potential to be archived with correctness
and accuracy.
• Archiveready.com is a reference implementation of
the CLEAR+ method.
• Archiveready.com provides a REST API for 3rd parties.
Discussion and conclusions
45
1. Web professionals
- evaluate the archivability of their websites
in an easy but thorough way,
- become aware of web preservation concepts,
- embrace preservation-friendly practices.
2. Web archive operators
- make informed decisions on archiving websites,
- perform large scale website evaluations with ease,
- automate web archiving Quality Assurance,
- minimise wasted resources on problematic websites.
3. Academics
- teach students about web archiving.
THANK YOU
Visit: http://archiveready.com
Contact: vbanos@gmail.com
Learn More:
• Banos V., Manolopoulos Y.: A quantitative approach to
evaluate Website Archivability using the CLEAR+ method,
International Journal on Digital Libraries, 2015.
• Banos V., Kim Y., Ross S., Manolopoulos Y.: CLEAR: a credible
method to evaluate website archivability, 10th International
Conference on Preservation of Digital Objects (iPRES’2013),
Lisbon, 2013.
ANY QUESTIONS?
46

More Related Content

Similar to Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03

The theory and practice of Website Archivability
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website ArchivabilityVangelis Banos
 
What’s Next with Accessibility?
What’s Next with Accessibility?What’s Next with Accessibility?
What’s Next with Accessibility?Keana Lynch
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Anna Perricci
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
10 Steps For a Successful Technology Scholarly Project
10 Steps For a Successful Technology Scholarly Project10 Steps For a Successful Technology Scholarly Project
10 Steps For a Successful Technology Scholarly Projectdsandro1
 
A Practical Guide to Content Strategy in HE
A Practical Guide to Content Strategy in HEA Practical Guide to Content Strategy in HE
A Practical Guide to Content Strategy in HEClare Kennedy
 
Research study on content management systems (CMS): issues with the conventio...
Research study on content management systems (CMS): issues with the conventio...Research study on content management systems (CMS): issues with the conventio...
Research study on content management systems (CMS): issues with the conventio...IRJET Journal
 
Capture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingCapture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingKristen Yarmey
 
Structuring Serendipitous Collaboration - Nick Inglis at Collab365 Conference
Structuring Serendipitous Collaboration - Nick Inglis at Collab365 ConferenceStructuring Serendipitous Collaboration - Nick Inglis at Collab365 Conference
Structuring Serendipitous Collaboration - Nick Inglis at Collab365 ConferenceNick Inglis
 
3 (de 3). Evaluación de Accessibilidad Digital
3 (de 3).  Evaluación de Accessibilidad Digital3 (de 3).  Evaluación de Accessibilidad Digital
3 (de 3). Evaluación de Accessibilidad DigitalDCU_MPIUA
 
Introduction to Web Archiving
Introduction to Web ArchivingIntroduction to Web Archiving
Introduction to Web ArchivingAnna Perricci
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)jiscpowr
 
Migrating to Drupal: Open Source Library Intranets
Migrating to Drupal: Open Source Library IntranetsMigrating to Drupal: Open Source Library Intranets
Migrating to Drupal: Open Source Library IntranetsNina McHale
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING ijcax
 

Similar to Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03 (20)

The theory and practice of Website Archivability
The theory and practice of Website ArchivabilityThe theory and practice of Website Archivability
The theory and practice of Website Archivability
 
What’s Next with Accessibility?
What’s Next with Accessibility?What’s Next with Accessibility?
What’s Next with Accessibility?
 
Cyber Security Website Analysis Project .pptx
Cyber Security Website Analysis Project .pptxCyber Security Website Analysis Project .pptx
Cyber Security Website Analysis Project .pptx
 
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
Human Scale Web Collecting for Individuals and Institutions (Webrecorder Work...
 
A1303060109
A1303060109A1303060109
A1303060109
 
A1303060109
A1303060109A1303060109
A1303060109
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
10 Steps For a Successful Technology Scholarly Project
10 Steps For a Successful Technology Scholarly Project10 Steps For a Successful Technology Scholarly Project
10 Steps For a Successful Technology Scholarly Project
 
A Practical Guide to Content Strategy in HE
A Practical Guide to Content Strategy in HEA Practical Guide to Content Strategy in HE
A Practical Guide to Content Strategy in HE
 
Research study on content management systems (CMS): issues with the conventio...
Research study on content management systems (CMS): issues with the conventio...Research study on content management systems (CMS): issues with the conventio...
Research study on content management systems (CMS): issues with the conventio...
 
Capture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web ArchivingCapture All the URLs: First Steps in Web Archiving
Capture All the URLs: First Steps in Web Archiving
 
Structuring Serendipitous Collaboration - Nick Inglis at Collab365 Conference
Structuring Serendipitous Collaboration - Nick Inglis at Collab365 ConferenceStructuring Serendipitous Collaboration - Nick Inglis at Collab365 Conference
Structuring Serendipitous Collaboration - Nick Inglis at Collab365 Conference
 
3 (de 3). Evaluación de Accessibilidad Digital
3 (de 3).  Evaluación de Accessibilidad Digital3 (de 3).  Evaluación de Accessibilidad Digital
3 (de 3). Evaluación de Accessibilidad Digital
 
Introduction to Web Archiving
Introduction to Web ArchivingIntroduction to Web Archiving
Introduction to Web Archiving
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Web (In)accessible
Web (In)accessible Web (In)accessible
Web (In)accessible
 
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
The JISC-PoWR Handbook - Identifying Web Issues (Richard Davis, ULCC)
 
Web mining
Web miningWeb mining
Web mining
 
Migrating to Drupal: Open Source Library Intranets
Migrating to Drupal: Open Source Library IntranetsMigrating to Drupal: Open Source Library Intranets
Migrating to Drupal: Open Source Library Intranets
 
RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING RESEARCH ISSUES IN WEB MINING
RESEARCH ISSUES IN WEB MINING
 

More from Vangelis Banos

Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Vangelis Banos
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
 
Η Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςVangelis Banos
 
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςVangelis Banos
 
Heterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeVangelis Banos
 
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...Vangelis Banos
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήVangelis Banos
 

More from Vangelis Banos (7)

Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!Αποθηκεύεται το διαδίκτυο; Web Archiving!
Αποθηκεύεται το διαδίκτυο; Web Archiving!
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Η Ιστορία της Μετρολογίας
Η Ιστορία της ΜετρολογίαςΗ Ιστορία της Μετρολογίας
Η Ιστορία της Μετρολογίας
 
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας ΜετρολογίαςΟ κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
Ο κόσμος των μικρών & των μεγάλων μέσα από το βλέμμα της κας Μετρολογίας
 
Heterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challengeHeterogeneity in european digital libraries, the europeana challenge
Heterogeneity in european digital libraries, the europeana challenge
 
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...Επιτυχημένα παραδείγματα διαλειτουργικότητας  σε ελληνικά αποθετήρια  και σχε...
Επιτυχημένα παραδείγματα διαλειτουργικότητας σε ελληνικά αποθετήρια και σχε...
 
Η τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτήΗ τεχνική υποδομή του εθνικού συσσωρευτή
Η τεχνική υποδομή του εθνικού συσσωρευτή
 

Recently uploaded

chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfTukamushabaBismark
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to VirusesAreesha Ahmad
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flyPRADYUMMAURYA1
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryAlex Henderson
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceAlex Henderson
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learninglevieagacer
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 

Recently uploaded (20)

chemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdfchemical bonding Essentials of Physical Chemistry2.pdf
chemical bonding Essentials of Physical Chemistry2.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Introduction to Viruses
Introduction to VirusesIntroduction to Viruses
Introduction to Viruses
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical ScienceFAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Analytical Science
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 

Website Archivability - Library of Congress NDIIPP Presentation 2015/06/03

  • 1. CLEAR+: a Credible Live Evaluation Method of Website Archivability Vangelis Banos, Yannis Manolopoulos 3 JUNE 2015 NATIONAL DIGITAL INFORMATION INFRASTRUCTURE AND PRESERVATION PROGRAM LIBRARY OF CONGRESS Data Engineering Lab Department of Informatics, Aristotle University, Thessaloniki , Greece ARCHIVEREADY.COM
  • 2. Website Archivability 2 Table of Contents 1. Motivation and problem definition, related work, 2. Website Archivability, 3. CLEAR+: A Credible Live method to Evaluate Website Archivability, 4. Demonstration: http://archiveready.com/, 5. Experimental Evaluation, 6. Use Cases, 7. Web Content Management Systems Archivability 8. Discussion – conclusions.
  • 3. 1. Motivation • Web developer: I’m building a website. Is it going to be archived correctly by a web archive? I don’t know until I see the archived snapshot… • Web archivist: Can I archive that website? I don’t know, let’s crawl it and we’ll see the results… • Professor: How can I teach my students about web archiving? 100’s of standards but not many relevant apps online… 3
  • 4. Problem definition • Web content acquisition is a critical step in the process of web archiving; • If the initial Submission Information Package lacks completeness and accuracy for any reason (e.g. missing or invalid web content), the rest of the preservation processes are rendered useless; • There is no guarantee that web bots dedicated to retrieving website content can access and retrieve it successfully; • Web bots face increasing difficulties in harvesting websites. 4
  • 5. 5 • Web harvesting is automated while Quality Assurance (QA) is mostly manual. • Web archives perform test crawls. • Humans review the results, resources are spent. • After web harvesting, administrators review manually the content and endorse or reject the harvested material. • Efforts to deploy crowdsourced techniques to manage QA provide an indication of how significant the bottleneck is. • (IIPC GA 2012 Crowdsourcing Workshop) Problem definition
  • 6. 6 1. the introduction of the notion of Website Archivability, 2. the Credible Live Evaluation of Archive Readiness Plus (CLEAR+) method to measure Website Archivability 3. ArchiveReady.com, a web application which implements the proposed method. Publications: • Banos V., Manolopoulos Y.: A quantitative approach to evaluate Website Archivability using the CLEAR+ method, International Journal on Digital Libraries (IJDL), 2015. • Banos V., Kim Y., Ross S., Manolopoulos Y.: CLEAR: a credible method to evaluate website archivability, iPRES’2013, Lisbon, 2013. 2. Our Contributions
  • 7. 7 1. Mechanism to improve the quality of web archives. 2. Expand and optimize the knowledge and practices of web archivists, supporting them in their decision making, and risk management. 3. Standardize the web aggregation practices of web archives, especially QA. 4. Foster good practices in web development, make sites more amenable to harvesting, ingesting, and preserving. 5. Raise awareness among web professionals regarding preservation. 6. Support web archiving training. Our Aims
  • 8. Website Archivability ? What is Website Archivability (WA) captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. Attention! it must not be confused with website dependability, reliability, availability, safety, security, survivability, maintainability.
  • 9. CLEAR+: A Credible Live Method to Evaluate Website Archivability • An approach to producing on-the-fly measurement of Website Archivability, • Web archives communicate with target websites via standard HTTP, • Information such as file types, content and transfer errors could be used to support archival decisions, • We combine this kind of information with an evaluation of the website's compliance with recognized practices in digital curation, • We generate a credible score representing the archivability of target websites. 9
  • 10. The main components of CLEAR+ 1. WA Facets: the factors that come into play and need to be taken into account to calculate total WA. 2. Website Attributes: the website homepage elements analysed to assess the WA Facets (e.g. the HTML markup code). 3. Evaluations: the tests executed on the website attributes (e.g. HTML code validation against W3C HTML standards) and approach used to combine the test results to calculate the WA metrics. 10
  • 11. 11 Accessibility Cohesion Standards Compliance Metadata CLEAR+: A Credible Live Method to Evaluate Website Archivability
  • 13. 13 CLEAR+ Evaluations 1. Perform specific Evaluations on Website Attributes, 2. In order to calculate each Archivability Facet’s score: • Scores range from (0 – 100%), • Evaluations significance varies: • High: critical issues which prevent web crawling or may cause highly problematic web archiving results. • Medium: issues which are not critical but may affect the quality of web archiving results. • Low: minor details which do not cause any issues when they are missing but will help web archiving when available 3. Website Archivability is the average of all Facets’ scores.
  • 15. Accessibility • A website is considered accessible only if web crawlers are able to visit its home page, traverse its content and retrieve it via standard HTTP requests. • Performance is also an important aspect of web archiving. Faster performance means faster web content ingestion. 15
  • 18. Cohesion • Relevant to: • Efficient operation of web crawlers, • Management of dependancies with digital curation. • If files constituting a single website are dispersed across different web locations, the acquisition and ingest is likely to risk suffering if one or more web locations fail. • Changes that occur outside the website are not going to affect it if it does not use 3rd party resources. 18
  • 21. Metadata • The adequate provision of metadata has been a continuing concern within digital curation. • The lack of metadata impairs the archive’s ability to manage, organise, retrieve and interact with content effectively. • Metadata may include descriptive or technical information. • Metadata increases the probability of successful information extraction and reuse in web archives after ingestion. 21
  • 24. Standards Compliance • Compliance with standards is a recurring theme in digital curation practices. It is recommended that for digital resources to be preserved they need to be represented in known and transparent standards. 24
  • 26. ArchiveReady.com 4. Demonstration - Web application implementing CLEAR+, - Web interface & also Web API in JSON, - Running on Linux, Python, Nginx, Redis, Mysql, PhantomJS headless browser. 26
  • 29. 5. Experimental evaluation • Questions: – How can we prove the validity of the Website Archivability metric? – Is it possible to calculate the WA of a website by evaluating a single webpage? 29
  • 30. Experiment 1: Evaluation using datasets 30
  • 31. Experiment 1: Evaluation using datasets 31
  • 32. Experiment 2: Evaluation by experts • Experts rank 200 websites according to the quality of their snapshots at the Internet Archive • We evaluate the same websites with archiveready.com • We calculate the Pearson’s Correlation Coefficient of our variables and find correlations. 32
  • 33. Experiment 3: WA variance in the pages of the same website • We evaluate only a single webpage to calculate website archivability. Is this correct? • Is the homepage WA representative of the whole website WA? • We use a website of 800 webpages and calculate the WA of 10 different webpages for each website to find out. 33
  • 34. Experiment 3: WA variation in the pages of the same website 34
  • 36. Use Case 1: Deutsches Literatur Archiv, Marbach, Germany • German literature web archiving project, • http://www.dla-marbach.de/dla/bibliothek/literatur_im_netz/netzliteratur/ • ~3.000 websites are preserved, • An evaluation of the archivability characteristics of these websites was necessary before crawling, • archiveready.com API was used to gain an insight on their properties http://archiveready.com/docs/api.html 36
  • 37. Use Case 2: Academia • Used by digital curation units, researchers and teachers. – University of Newcastle, UK, – Columbia University Libraries, – Stanford University Libraries, – University of Michigan, Bentley Historical Library, – Old Dominion University. 37
  • 39. Web CMS Archivability • CMS dominate the web – (Wordpress, Drupal, Joomla, MovableType, +++) • CMS constitute a common technical framework for web publishing. • If a CMS is ‘incompatible’ with some web archiving aspect, millions of websites are affected and web archives suffer. 39
  • 40. Web CMS Archivability • Our contribution: – We study 12 prominent web CMS. – We conduct experiments with a sample of ~5.800 websites based on these CMS. – We make specific observations on the Website Archivability characteristics of each CMS. • Paper (under review): – Web Content Management Systems Archivability, Banos V., Manolopoulos Y., ADBIS 2015’ 40
  • 42. Web CMS Archivability • Indicative results: – Drupal has the third highest WA score (82.08%). It has good overall performance and the only issue is the existence of too many inline scripts per instance (15.09). – DotNetNuke has the second worst WA score in our evaluation (77.2%). We suggest that they look into their RSS feeds (13% Correct). and lacking HTTP caching support (5%). – Typo3 WA score is average (79%). It has the largest number of invalid URLs per instance (12%). 42
  • 44. Discussion and conclusions 44 • Introducing a new metric to quantify the previously unquantifiable notion of WA is not an easy task. • CLEAR+ and Website Archivability capture the core aspects of a website crucial in diagnosing whether it has the potential to be archived with correctness and accuracy. • Archiveready.com is a reference implementation of the CLEAR+ method. • Archiveready.com provides a REST API for 3rd parties.
  • 45. Discussion and conclusions 45 1. Web professionals - evaluate the archivability of their websites in an easy but thorough way, - become aware of web preservation concepts, - embrace preservation-friendly practices. 2. Web archive operators - make informed decisions on archiving websites, - perform large scale website evaluations with ease, - automate web archiving Quality Assurance, - minimise wasted resources on problematic websites. 3. Academics - teach students about web archiving.
  • 46. THANK YOU Visit: http://archiveready.com Contact: vbanos@gmail.com Learn More: • Banos V., Manolopoulos Y.: A quantitative approach to evaluate Website Archivability using the CLEAR+ method, International Journal on Digital Libraries, 2015. • Banos V., Kim Y., Ross S., Manolopoulos Y.: CLEAR: a credible method to evaluate website archivability, 10th International Conference on Preservation of Digital Objects (iPRES’2013), Lisbon, 2013. ANY QUESTIONS? 46

Editor's Notes

  1. Abstract: Web archiving is crucial to ensure that cultural, scientific and social heritage on the web remains accessible and usable over time. A key aspect of the web archiving process is opti- mal data extraction from target websites. This procedure is difficult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended na- ture of the web. The purpose of this work is to establish the notion of Website Archivability (WA) and to introduce the Credible Live Evaluation of Archive Readiness (CLEAR) method to measureWA for any website. Website Archivabil- ity captures the core aspects of a website crucial in diagnos- ing whether it has the potentiality to be archived with com- pleteness and accuracy. An appreciation of the archivability of a web site should provide archivists with a valuable tool when assessing the possibilities of archiving material and in- fluence web design professionals to consider the implications of their design decisions on the likelihood could be archived. A prototype application, archiveready.com, has been estab- lished to demonstrate the viabiity of the proposed method for assessing Website Archivability.
  2. Dirty data -> useless system As websites become more sophisticated and complex, the difficulties that web bots face in harvesting them increase. For instance, some web bots have limited abilities to process GIS les, dynamic web content, or streaming media [16]. To overcome these obstacles, standards have been developed to make websites more amenable to harvesting by web bots. Two examples are the Sitemaps.xml and Robots.txt protocols. Such protocols are not used universally.
  3. According to the web archiving process followed by the National Library of New Zealand, after performing the harvests, the operators review and endorse or reject the harvested material; accepted material is then deposited in the repository. WCT supports such web archiving processes as permissions, job scheduling, harvesting, quality review, and the collection of descriptive metadata. Focusing on quality review, when a harvest is complete, the harvest result is saved in the digital asset store, and the Target Instance is saved in the Harvested state. The next step is for the Target Instance Owner to Quality Review the harvest. WCT operators perform this task manually. E.g. IIPC has organized a Crowdsourcing workshop which included a QA task
  4. Website archivability must not be confused with website dependability, the former refers to the ability to archive a website while the latter is a system property that integrates such attributes as reliability, availability, safety, security, survivability and maintainability[1].
  5. The concept of CLEAR emerged from our current research in web preservation in the context of the BlogForever project which involves weblog harvesting and archiving. Our work revealed the need for a method to assess website archive readiness in order to support web archiving workflows.
  6. Cohesion is tested on three levels: • examining how many hosts are employed in relation to the location of referenced media content, • examining how many hosts are employed in relation to supporting resources (e.g. robots.txt, sitemap.xml, and javascripts), • examining the number of times proprietary software or plugins are referenced.
  7. Already contacted by the following institutions The Internet Archive, University of Manchester, Columbia University Libraries, Society of California Archivists General Assembly, Old Dominion University, Virginia, USA, Digital Archivists in Netherlands.
  8. Already contacted by the following institutions The Internet Archive, University of Manchester, Columbia University Libraries, Society of California Archivists General Assembly, Old Dominion University, Virginia, USA, Digital Archivists in Netherlands.