Web Archiving Profile - WADL 2013

•Télécharger en tant que PPTX, PDF•

0 j'aime•2,425 vues

Ahmed AlSum

Formation Technologie

Why
• To optimize the query routing for Memento
Aggregator.
• To determine the missing parts of the web.

Who
Full text URI-lookup
Internet Archive x
Library of Congress x
Icelandic Web Archive x
Library and Archives Canada x x
British Library x x
UK National Library x x
Portuguese Web Archive x x
Web Archive of Catalonia x x
Croatian Web Archive x x
Archive of the Czech Web x x
National Taiwan University x x
Archive IT x x

How
• Sampling from different sources
• Retrieve the TimeMap from each archive
• Analyze the TimeMaps

URIs Samples Sources
Web
1. DMOZ – Random sample
2. DMOZ – TLD %2 of each
TLD from DMOZ (.com,
.org, .jp, etc 52 TLD)
3. DMOZ – Languages 100
URIs for each Languages (24
lang.)
Web Archives
4. Top 1-Gram from Bing
5. Top 1000 queries term
by Yahoo in 9 languages
User requests
6. IA Wayback Machine Log files
7. Memento aggregator log files
* We used hostnames only

Contenu connexe

Tendances

FLAX: Flexible Language Acquisition with Open Data-Driven LearningAlannah Fitzgerald

Bridging Informal MOOCs & Formal English for Academic Purposes Programmes wit...Alannah Fitzgerald

Rs detective afplLYRASIS_PRODEV

ArchivegridRegan Harper

Library resources- CSD - 08 2016Linda Galloway

UW Libraries: Research Smarter, Not Harderuwlibeo

03 Researchfriendly Org2Inria

Eaa2014 open access_session_4_g.eberhardt+n.riedl_topoi_final_13092014ariadnenetwork

Publishing and Using Linked Open Data - Day 2Richard Urban

The Open Source Library: It's Free As in PuppyTiffany Garrett

A review of the state of the art in Machine Learning on the Semantic WebSimon Price

Forging New Links: Libraries in the Semantic WebGillian Byrne

Thompson 6-jun15-finalNational Information Standards Organization (NISO)

Linked data 101: Getting Caught in the Semantic Web Morgan Briles

Sharing an Open Methodology for Building Domain-specific Corpora for EAP Alannah Fitzgerald

RDF in Hydra Summit OverviewKaren Estlund

April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersNational Information Standards Organization (NISO)

Consuming Linked Data by Humans - WWW2010Juan Sequeda

Bracke may4-1National Information Standards Organization (NISO)

Data management basics, for UC Davis EDU 292Phoebe Ayers

Tendances (20)

FLAX: Flexible Language Acquisition with Open Data-Driven Learning

Bridging Informal MOOCs & Formal English for Academic Purposes Programmes wit...

Rs detective afpl

Archivegrid

Library resources- CSD - 08 2016

UW Libraries: Research Smarter, Not Harder

03 Researchfriendly Org2

Eaa2014 open access_session_4_g.eberhardt+n.riedl_topoi_final_13092014

Publishing and Using Linked Open Data - Day 2

The Open Source Library: It's Free As in Puppy

A review of the state of the art in Machine Learning on the Semantic Web

Forging New Links: Libraries in the Semantic Web

Thompson 6-jun15-final

Linked data 101: Getting Caught in the Semantic Web

Sharing an Open Methodology for Building Domain-specific Corpora for EAP

RDF in Hydra Summit Overview

April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters

Consuming Linked Data by Humans - WWW2010

Bracke may4-1

Data management basics, for UC Davis EDU 292

En vedette

Old Dominion University Computer Science IIPC New Member Michael Nelson

Site story wadl2013Martin Klein

Archiving the Mobile WebFrank McCown

The Revolution Will Not Be ArchivedMat Kelly

Word Clouds from Twitter Follower Descriptionscorrenm

Tweet Visibility Dynamics in a Tweet Conversation GraphAlexander Nwala

En vedette (6)

Old Dominion University Computer Science IIPC New Member

Site story wadl2013

Archiving the Mobile Web

The Revolution Will Not Be Archived

Word Clouds from Twitter Follower Descriptions

Tweet Visibility Dynamics in a Tweet Conversation Graph

Similaire à Web Archiving Profile - WADL 2013

Profiling Web ArchivesMichael Nelson

Profiling Web Archive Coverage for Top-Level Domain and Content LanguageMichael Nelson

Web archiving challenges and opportunitiesAhmed AlSum

ITS Projects and Services Showcase - June 2013University of Toronto Libraries - Information Technology Services

Slides anu talkwebarchivingaug2012Roxanne Missingham

ALIAOnline Practical Linked (Open) Data for Libraries, Archives & MuseumsJon Voss

Archival TechnologiesCliff Landis

Hub and Spokes Development June07Jane Stevenson

10-31-13 “Researcher Perspectives of Data Curation” Presentation SlidesDuraSpace

IIPC GA 2014 SolrAndy Jackson

Capture All the URLS: First Steps in Web ArchivingKristen Yarmey

Easter JISC metadata May25 DTdstudhope

Linked Open Data for Libraries, Archives, and Museums: An Aggregators ViewRichard Urban

Internet content as research dataNational Library of Australia

Fri schreiber key_knowledge engineeringeswcsummerschool

Web Archive Profiling Through Fulltext SearchSawood Alam

Drupal Open Source Everythinglibrarywebchic

Creating and Maintaining Web ArchivesMARAC Bethlehem PC

Institutional Repository - May 2010Jill Patrick

Reborn Digital: coding textPip Willcox

Similaire à Web Archiving Profile - WADL 2013 (20)

Profiling Web Archives

Profiling Web Archive Coverage for Top-Level Domain and Content Language

Web archiving challenges and opportunities

ITS Projects and Services Showcase - June 2013

Slides anu talkwebarchivingaug2012

ALIAOnline Practical Linked (Open) Data for Libraries, Archives & Museums

Archival Technologies

Hub and Spokes Development June07

10-31-13 “Researcher Perspectives of Data Curation” Presentation Slides

IIPC GA 2014 Solr

Capture All the URLS: First Steps in Web Archiving

Easter JISC metadata May25 DT

Linked Open Data for Libraries, Archives, and Museums: An Aggregators View

Internet content as research data

Fri schreiber key_knowledge engineering

Web Archive Profiling Through Fulltext Search

Drupal Open Source Everything

Creating and Maintaining Web Archives

Institutional Repository - May 2010

Reborn Digital: coding text

Plus de Ahmed AlSum

Restoring US First WebsiteAhmed AlSum

"Web Archive services framework for tighter integration between the past and ...Ahmed AlSum

Thumbnail Summarization Techniques For Web ArchivesAhmed AlSum

Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013Ahmed AlSum

ArcLink - IIPC GA 2013Ahmed AlSum

How Much of the Web is Archived? JCDL 2011Ahmed AlSum

Plus de Ahmed AlSum (6)

Restoring US First Website

"Web Archive services framework for tighter integration between the past and ...

Thumbnail Summarization Techniques For Web Archives

Archival HTTP Redirection Retrieval Policies - TemporalWeb 2013

ArcLink - IIPC GA 2013

How Much of the Web is Archived? JCDL 2011

Dernier

Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco

Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530

Measures of Position DECILES for ungrouped dataBabyAnnMotar

Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy

Paradigm shift in nursing research by RS MEHTABP KOIRALA INSTITUTE OF HELATH SCIENCS,, NEPAL

Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup

USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.

TEACHER REFLECTION FORM (NEW SET........).docxruthvilladarez

Millenials and Fillennials (Ethical Challenge and Responses).pptxJanEmmanBrigoli

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña

Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptxlancelewisportillo

Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxConquiztadors- the Quiz Society of Sri Venkateswara College

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña

Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99

ICS2208 Lecture6 Notes for SL spaces.pdfVanessa Camilleri

Keynote by Prof. Wurzer at Nordex about IP-designMIPLM

Dernier (20)

Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf

Integumentary System SMP B. Pharm Sem I.ppt

Measures of Position DECILES for ungrouped data

Student Profile Sample - We help schools to connect the data they have, with ...

Paradigm shift in nursing research by RS MEHTA

Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf

Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf

USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...

TEACHER REFLECTION FORM (NEW SET........).docx

Millenials and Fillennials (Ethical Challenge and Responses).pptx

THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION

Daily Lesson Plan in Mathematics Quarter 4

Q4-PPT-Music9_Lesson-1-Romantic-Opera.pptx

Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)

YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx

MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx

INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx

Choosing the Right CBSE School A Comprehensive Guide for Parents

ICS2208 Lecture6 Notes for SL spaces.pdf

Keynote by Prof. Wurzer at Nordex about IP-design

Web Archiving Profile - WADL 2013

1. Web Archiving Profile OverviewAhmed AlSum PhD Candidate Old Dominion University Web Archiving and Digital Libraries (WADL 2013) A Workshop at JCDL 2013 July 25-26, 2013 Indianapolis, Indiana, USA

2. What is the problem? • Web Archives are blackbox, it just accessible through textbox search (full-text or URI-lookup) • We need to profile/characterize the web archives around the world such as: o Age o Top-level domains o Languages o Growth rate

3. Why • To optimize the query routing for Memento Aggregator. • To determine the missing parts of the web.

4. Who Full text URI-lookup Internet Archive x Library of Congress x Icelandic Web Archive x Library and Archives Canada x x British Library x x UK National Library x x Portuguese Web Archive x x Web Archive of Catalonia x x Croatian Web Archive x x Archive of the Czech Web x x National Taiwan University x x Archive IT x x

5. How • Sampling from different sources • Retrieve the TimeMap from each archive • Analyze the TimeMaps

6. URIs Samples Sources Web 1. DMOZ – Random sample 2. DMOZ – TLD %2 of each TLD from DMOZ (.com, .org, .jp, etc 52 TLD) 3. DMOZ – Languages 100 URIs for each Languages (24 lang.) Web Archives 4. Top 1-Gram from Bing 5. Top 1000 queries term by Yahoo in 9 languages User requests 6. IA Wayback Machine Log files 7. Memento aggregator log files * We used hostnames only

7. General Coverage

8. Web Archive Growth Rate

9. TLD Sample Coverage

10. TLD per archive (TLD Sample)

11. TLD per archive (Fulltext search)

12. TLD across archives

13. Languages distribution per archive

14. Query Routing Evaluation