SlideShare une entreprise Scribd logo
1  sur  64
An Extensible Framework for
      Creating Personal Web Archives of
       Content Behind Authentication
                        Mat Kelly
               Director:       Michele C. Weigle
               Committee:      Michael L. Nelson
                               Yaohang Li




8/3/2012              MS Thesis - August 2012
Background
• Internet Archive crawls and preserves
  webpages creating web archives
• Only public sites
  are preserved




  8/3/2012            MS Thesis - August 2012   2
Problems
• A lot of content on web is not preserved
      – e.g., Social media content
• As more people document lives on social
  media, importance of preserving becomes
  greater
• Content not preserved = heritage lost



8/3/2012                 MS Thesis - August 2012   3
Problems:
           Unsuitability of Institutional Tools

• Overhead and
  learning curve
  is steep
• Institutional
  tools meant for
  larger scale



8/3/2012             MS Thesis - August 2012      4
Problems:
           Complete Lack of Preservation




8/3/2012          MS Thesis - August 2012   5
State of the Art in
              Personal Web Archiving
• Personal web archiving tools
      – Break when target sites’ hierarchy changes
      – Produce sub-optimal archives
• Some conventional web archiving practices
  not easily translatable to personal web
  archiving



8/3/2012                 MS Thesis - August 2012     6
Goals of Thesis
• Show social media content can be preserved
      – With output more optimal than current offerings
• Remedy the tools’ breaking problem
      – Remotely specify target sites’ hierarchies
      – Show spec is easily adaptable to tools
• Identify and consider solutions to domain-
  specific nuances
• Establish section commonality between social
  media websites
8/3/2012                 MS Thesis - August 2012          7
Extent of the Unpreserved




8/3/2012           MS Thesis - August 2012   8
Ways to Capture Missing Content:
            Supply crawler with auth credentials

• Unsuitable for institutional crawlers
• Other Personal Web Archiving problems
  remain




 8/3/2012              MS Thesis - August 2012     9
Ways to Capture Missing Content:
                     “Save As” Desired Pages

• Miss metadata
• Doesn’t produce interoperable output




 8/3/2012                MS Thesis - August 2012   10
Ways to Capture Missing Content:
                     Utilize Fetching Tools

   – Lose look & feel
   – Difficult capturing
     all content desired
   – Frequently sub-
     optimal output
     format




8/3/2012                   MS Thesis - August 2012   11
Tools Utilized In Thesis:
             Archive Facebook
• Firefox add-on
• Creates navigable
  “web archives”
• Outputs files w/
  original file type
• Sequential Archiving




8/3/2012             MS Thesis - August 2012   12
Tools Utilized In Thesis:
                 WARCreate
• Google Chrome extension
• Creates Wayback-
  Compatible Web ARChive
  (WARC) files
• Allows page manipulation
  prior to generating archive




8/3/2012              MS Thesis - August 2012   13
Integration with Other Tools
• Wayback (WARC replay system)
      – Allows WARCreate output to be re-experienced
      – Provides content for Memento
• Memento
      – Allows temporal traversal of archived pages
      – Timegate serves as relay only to local
        wayback instance
• XAMPP (Client-Side Server Suite)
      – Overcome Javascript inadequacies
      – Provide foundation for replay system
8/3/2012                 MS Thesis - August 2012       14
Institutional vs. Personal Web Archiving




 8/3/2012       MS Thesis - August 2012   15
Institutional vs. Personal Web Archiving




 8/3/2012       MS Thesis - August 2012   16
Institutional vs. Personal Web Archiving



                                          Crawls
                                          WWW




 8/3/2012       MS Thesis - August 2012            17
Institutional vs. Personal Web Archiving



                                          Crawls
                                          WWW




 8/3/2012       MS Thesis - August 2012            18
Institutional vs. Personal Web Archiving

                                          Crawls
                                          WWW



                                               outputs


                                          WARC




 8/3/2012       MS Thesis - August 2012              19
Institutional vs. Personal Web Archiving

                                          Crawls
                                          WWW



                                               outputs


                                          WARC




 8/3/2012       MS Thesis - August 2012              20
Institutional vs. Personal Web Archiving

                                          Crawls
                                          WWW



                                               outputs


                                          WARC




 8/3/2012       MS Thesis - August 2012              21
Institutional vs. Personal Web Archiving

                                              Crawls
                                              WWW



                                                   outputs



Publicly viewable                             WARC
 Archive replay




    8/3/2012        MS Thesis - August 2012              22
Institutional vs. Personal Web Archiving




 8/3/2012       MS Thesis - August 2012   23
Institutional vs. Personal Web Archiving




 8/3/2012       MS Thesis - August 2012   24
Institutional vs. Personal Web Archiving




 8/3/2012       MS Thesis - August 2012   25
Institutional vs. Personal Web Archiving




                                          WARC




 8/3/2012       MS Thesis - August 2012          26
Institutional vs. Personal Web Archiving




                                          WARC




 8/3/2012       MS Thesis - August 2012          27
Institutional vs. Personal Web Archiving




                                          WARC




 8/3/2012       MS Thesis - August 2012          28
Institutional vs. Personal Web Archiving




                                          WARC




 8/3/2012       MS Thesis - August 2012          29
Problems Specific to
              Personal Web Archiving
• Personalization/Authentication
      – Different users, facebook.com, different content
• Context
      – Different browsing tools, different site experience
• Output Format
      – Ad hoc approaches are often used that lose
        metadata, context, content, etc.


8/3/2012                 MS Thesis - August 2012              30
Personalization/Authentication
• Two users, same URI, vastly different content
• One user, same URI, authentication vs. no
  authentication, different content
      – As shown in IA’s archive of FB




8/3/2012                 MS Thesis - August 2012   31
Context
• Same URI+diff devices
    = diff content served
• Mobile vs. PC
• Firefox vs. Chrome

<!--[if lt IE 5]>
Your browser is too old and cannot
render this content.
<![endif]-->
<!--[if gte IE 9]>
...features not supported by version
of IE prior to 9...
<![endif]-->


8/3/2012                       MS Thesis - August 2012   32
Output Format




8/3/2012      MS Thesis - August 2012   33
Output Format
• Saving only HTML is not
  enough
• Local references need
  manipulation
• Browser alone is
  insufficient replay system




8/3/2012                 MS Thesis - August 2012   34
Output Format
• Misses HTTP headers          REQUEST
                               GET / HTTP/1.1 Host: www.facebook.com User-Agent:
   • Request & Response




                                                                                          NOT CAPTURED BY BACKUP TOOLS/METHODS
                               Mozilla/5.0 (Windows NT 5.1; rv:14.0)
                               Gecko/20100101 Firefox/14.0.1 Accept:

   • e.g., Auth                text/html,application/xhtml+xml,application/xml;q=0
                               .9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5
                               Accept-Encoding: gzip, deflate Connection: keep-
• If headers included,         alive Cookie: datr=KMo6T3jicPEdEl4pY2yFnr6F;
                               lu=TgU4dhoSBG0ZmEnThtLeyqIA;
  inputs for personalization   c_user=100003509861423; fr=0KMqEWNPPgver2SIx.AWXf-
                               6Ww_7iQFPPP9sFtiiMPaV0; s=Aa4dL41H8UGZ-4Lf.BQGryl;
                               xs=1%3Am7APtmN9-ev4Vg%3A0%3A1343929509;
  can be viewed                act=1343929622029%2F3%3A2; p=1;
                               presence=EM343929627EuserFA21B03509861423A2EstateFD
                               sb2F0Et2F_5b_5dElm2FnullEuct2F1343929017BEtrFnullEt
                               wF3302582290EatF1343929627063EutF0EsndF1EnotF0CEchF
                               RESPONSE
                               Dp_5f1B03509861423F1CC
                               HTTP/1.1 200 OK Cache-Control: private, no-
                               cache, no-store, must-revalidate Expires: Sat, 01
                               Jan 2000 00:00:00 GMT P3P: CP="Facebook does not
                               have a P3P policy. Learn why here:
                               http://fb.me/p3p" Pragma: no-cache X-Content-Type-
                               Options: nosniff x-frame-options: DENY X-XSS-
                               Protection: 1; mode=block Content-Encoding: gzip
                               Content-Type: text/html; charset=utf-8 X-FB-Debug:
                               uMXm8343NOn0OOIeDna2teVECApUiEqj6s7GTwNx+Ss= Date:
                               Thu, 02 Aug 2012 19:26:12 GMT Transfer-Encoding:
                               chunked Connection: keep-alive
 8/3/2012                  MS Thesis - August 2012                                   35
Specification and OOP
• Sites’ hierarchies resemble OOP concepts
  (polymorphism, inheritance)
• Sites’ sections can be represented as classes
• Classes converted to XML specification
• Personal Web Archiving tools utilize this
  specification to become adaptive



8/3/2012            MS Thesis - August 2012       36
Commonality of “Sections”
             Between Social Media Websites
Abstracted media
      type
 personal stream        wall                          posts        my tweets
   global stream      news feed                     streams     followees’ tweets
multimedia - photos    photos                        photos
multimedia - videos    videos                         videos
 photo collection      albums
           posts        notes
       friends         friends                        circles




8/3/2012                        MS Thesis - August 2012                         37
Example: Facebook Section Objects
           SocialMediaWebsite facebook =
             new SocialMediaWebsite(homepage => "http://www.facebook.com")
           facebook->decorate([
            new SocialMediaWebsiteSectionPersonalStream(
             name => "Wall",
             url => "http://www.facebook.com/profile.php?sk=wall",
             preprocessor => new SocialMediaScrollPrepreprocessor(
               timeBetweenFirings => 0,
               maxFirings = 0,
               conditionBeforeSubsequentFirings = null
             )
            ),
            new SocialMediaWebsiteSectionUserInfo(
             name => "Info",
             url => "http://www.facebook.com/profile.php?sk=info"
            ),
            new SocialMediaWebsiteSectionMultimediaCollection(
             name => "Photos",
             url => "http://www.facebook.com/profile.php?sk=photos",
             proprocessor => new SocialMediaScrollPreprocessor(
               timeBetweenFirings => 0,
               maxFirings => 0,
               conditionBeforeSubsequentFirings = null
             )
            ),
            ...

8/3/2012               MS Thesis - August 2012                           38
Example: Facebook Section Objects
           SocialMediaWebsite facebook =
             new SocialMediaWebsite(homepage => "http://www.facebook.com")
           facebook->decorate([
            new SocialMediaWebsiteSectionPersonalStream(
             name => "Wall",
             url => "http://www.facebook.com/profile.php?sk=wall",
             preprocessor => new SocialMediaScrollPrepreprocessor(
               timeBetweenFirings => 0,
               maxFirings = 0,
               conditionBeforeSubsequentFirings = null
             )
            ),
            new SocialMediaWebsiteSectionUserInfo(
             name => "Info",
             url => "http://www.facebook.com/profile.php?sk=info"
            ),
            new SocialMediaWebsiteSectionMultimediaCollection(
             name => "Photos",
             url => "http://www.facebook.com/profile.php?sk=photos",
             proprocessor => new SocialMediaScrollPreprocessor(
               timeBetweenFirings => 0,
               maxFirings => 0,
               conditionBeforeSubsequentFirings = null
             )
            ),
            ...

8/3/2012               MS Thesis - August 2012                           39
Example: Facebook Section Objects
           SocialMediaWebsite facebook =
             new SocialMediaWebsite(homepage => "http://www.facebook.com")
           facebook->decorate([
            new SocialMediaWebsiteSectionPersonalStream(
             name => "Wall",
             url => "http://www.facebook.com/profile.php?sk=wall",
             preprocessor => new SocialMediaScrollPrepreprocessor(
               timeBetweenFirings => 0,
               maxFirings = 0,
               conditionBeforeSubsequentFirings = null
             )
            ),
            new SocialMediaWebsiteSectionUserInfo(
             name => "Info",
             url => "http://www.facebook.com/profile.php?sk=info"
            ),
            new SocialMediaWebsiteSectionMultimediaCollection(
             name => "Photos",
             url => "http://www.facebook.com/profile.php?sk=photos",
             proprocessor => new SocialMediaScrollPreprocessor(
               timeBetweenFirings => 0,
               maxFirings => 0,
               conditionBeforeSubsequentFirings = null
             )
            ),
            ...

8/3/2012               MS Thesis - August 2012                           40
Example: Hierarchical Similarities
            SocialMediaWebsite facebook =
              new SocialMediaWebsite(homepage => "http://www.facebook.com")
            facebook->decorate([
             new SocialMediaWebsiteSectionPersonalStream(
              name => "Wall",
              url => "http://www.facebook.com/profile.php?sk=wall",
              preprocessor => new SocialMediaScrollPrepreprocessor(
                timeBetweenFirings => 0,
                maxFirings = 0,
                conditionBeforeSubsequentFirings = null
              )
             ),
             new SocialMediaWebsiteSectionUserInfo(
              name => "Info",
              url => "http://www.facebook.com/profile.php?sk=info"
             ),
             new SocialMediaWebsiteSectionMultimediaCollection(
              name => "Photos",
              url => "http://www.facebook.com/profile.php?sk=photos",
              proprocessor => new SocialMediaScrollPreprocessor(
                timeBetweenFirings => 0,
                maxFirings => 0,
                conditionBeforeSubsequentFirings = null
              )
             ),
             ...

8/3/2012                MS Thesis - August 2012                           41
Spec Retrieval Process
1. Tool accesses root spec
                                             Root                     Site
   w/ URI parameter                          Spec                     Spec

2. Spec returns with
   reference to site-                           (spec)/facebook.xml


   specific hierarchy spec
3. Tool fetches site spec
4. Updated site hierarchy
   returned

8/3/2012           MS Thesis - August 2012                                   42
Concrete Usage – Tool Adaptation
• Archive Facebook
      – Map current URIs to remotely fetched URIs
      – Perform pre-processing defined in FB spec
• WARCreate
      – Implement sequential/cohesive archiving




8/3/2012                MS Thesis - August 2012     43
Evaluation 1:Tool Adaptability
1.    Setup synthetic social media website
2.    Define site’s remote spec
3.    Change AFB to preserve synthetic site
4.    Change hierarchy of synthetic site
5.    Show AFB breaking
6.    Change synthetic site spec
7.    Show AFB functionality restored

8/3/2012              MS Thesis - August 2012   44
Evaluation 1: Tool Adaptability
                Step 1: Synthetic Site Creation

• Simple hierarchy
  for base case testing
• Requires Auth
• Utilizes CDN
• Can be manipulated
• Recursive Sections



8/3/2012                 MS Thesis - August 2012   45
Evaluation 1: Tool Adaptability
               Step 2: Define Site Remove Spec
                                        <?xml version="1.0" ?>
                                        <socialMediaWebsite>
                                         <homepage>http://test.socialstandard.org</homepage>
                                         <sections>
                                          <socialMediaWebsiteSection type="SocialMediaWebsiteSectionPersonalStream">
                                           <name>Personal Stream</name>
                                           <url>http://test.socialstandard.org/personal</url>
                                           <preprocessor type="SocialMediaScrollPreprocessor">
                                            <timeBetweenFirings>0</timeBetweenFirings>
                                            <maxFirings>0</maxFirings>
                                            <conditionBeforeSubsequentFiring>?</conditionBeforeSubsequentFiring>
                                           </preprocessor>
                                          </socialMediaWebsiteSection>
                                          <socialMediaWebsiteSection type="SocialMediaWebsiteSectionMultimediaCollection">
                                           <name>Photo Albums</name>
                                           <url>http://test.socialstandard.org/albums</url>
                                           <preprocessor type="SocialMediaScrollPreprocessor">
                                            <timeBetweenFirings>0</timeBetweenFirings>
                                            <maxFirings>0</maxFirings>
                                            <conditionBeforeSubsequentFiring>?</conditionBeforeSubsequentFiring>
                                           </preprocessor>
                                           <children>
                                            <regex>&lt;div class="album.*&lt;ashref="(.*)"</regex>
                                            <type>SocialMediaWebsiteSectionMultimediaCollection</type>
                                            <name>Photo Album</name>
                                           </children>
                                          </socialMediaWebsiteSection>
                                          <socialMediaWebsiteSection type="SocialMediaWebsiteSectionMultimediaCollection">
                                           <name>Photo Album</name>
                                           <url>http://test.socialstandard.org/album/[a-zA-Z0-9]+</url>
                                           <preprocessor type="SocialMediaScrollPreprocessor">
                                            <timeBetweenFirings>0</timeBetweenFirings>
                                            <maxFirings>0</maxFirings>
                                            <conditionBeforeSubsequentFiring>?</conditionBeforeSubsequentFiring>
8/3/2012                MS Thesis   - August 2012
                                           </preprocessor>                                                          46
                                           <children>
                                            <regex>&lt;div class="album.*&lt;ashref="(album/[a-zA-Z0-9]+)"</regex>
Evaluation 1: Tool Adaptability
           Step 3: Change AFB to preserve synthetic site
                                    getCurrentSiteSpec : function(step,urlIn,hostIn){


• Utilize existing capture
                                     switch(step){
                                      case 0:
                                      var xhr = new XMLHttpRequest();
                                      var siteSpec = "", uriOut = "";
  mechanisms                          $.ajax({
                                         url: urlIn,
                                         success: function(data){

• Exploit guaranteed                        var host = "www.facebook.com"; //hostIn n/a here
                                            var parser = new DOMParser();
                                            var socialMediaWebsites = $(data.childNodes[0]).children

  attributes (e.g., host)                   for(var i=0; i<socialMediaWebsites.length; i++){
                                              var smw = socialMediaWebsites[i];
                                              if($(smw).find("homepage").text().indexOf(host) != -1)
                                                siteSpec = $(smw).find("specification").text();

• Make code general                             getCurrentSiteSpec(1,siteSpec,host);
                                              } //fi
                                            } //rof

  enough to be widely                    }, error: function(){}
                                        }); //xaja
                                        break;

  applicable to sections                case 1:
                                          $.ajax({
                                          url: urlIn,
                                          success: function(data){
                                             var ls = window.content.localStorage;
                                             ls.setItem("spec", (new XMLSerializer()).serializeToStr
                                             archivefbBrowserOverlay.capture(ls.getItem("spec"));
                                            }, error : function(){}
                                         };
                                         break;
                                      }
                                    }
8/3/2012                   MS Thesis - August 2012                                        47
Evaluation 1: Tool Adaptability
            Step 4: Change hierarchy of synthetic site

• Simulate simply through mod_rewrite
• Previously:
 RewriteRule ^personal$
    index.php?section=personal [NC]

• Updated:
 RewriteRule ^myfeed$
    index.php?section=personal [NC]
• Disavow previous reference altogether to
  ensure 404

8/3/2012                  MS Thesis - August 2012        48
Evaluation 1: Tool Adaptability
                 Step 5: Show AFB breaking

• Run archiving procedure again, note failing of
  procedure or content not captured




8/3/2012               MS Thesis - August 2012     49
Evaluation 1: Tool Adaptability
                           Step 6: Change synthetic site spec
<?xml version="1.0" ?>                                <?xml version="1.0" ?>
<socialMediaWebsite>                                  <socialMediaWebsite>
 <homepage>http://test.socialstandard.org</homepage> <homepage>http://test.socialstandard.org</homepage>
 <sections>                                            <sections>
  <socialMediaWebsiteSection                            <socialMediaWebsiteSection
    type="SocialMediaWebsiteSectionPersonalStream">       type="SocialMediaWebsiteSectionPersonalStream">
   <name>Personal Stream</name>                          <name>Personal Stream</name>
   <url>http://test.socialstandard.org/personal</url>    <url>http://test.socialstandard.org/myfeed</url>
   <preprocessor                                         <preprocessor
    type="SocialMediaScrollPreprocessor">                 type="SocialMediaScrollPreprocessor">
    <timeBetweenFirings>0</timeBetweenFirings>            <timeBetweenFirings>0</timeBetweenFirings>
    <maxFirings>0</maxFirings>                            <maxFirings>0</maxFirings>
    <conditionBeforeSubsequentFiring>?</conditionBefo     <conditionBeforeSubsequentFiring>?</conditionBefo
    reSubsequentFiring>                                   reSubsequentFiring>
   </preprocessor>                                       </preprocessor>
  </socialMediaWebsiteSection>                          </socialMediaWebsiteSection>
  …                                                     …




     8/3/2012                              MS Thesis - August 2012                                50
Evaluation 1: Tool Adaptability
             Step 7: Show AFB functionality restored

• Execute archiving
  procedure of tool
  w/o modifying code
• Show that result
  matches step 1




8/3/2012                  MS Thesis - August 2012      51
Evaluation 2: Preservation of
               Content Behind Authentication

1. Create tool (WARCreate) to store to
   WARC format
2. Setup easy-to-use Replay system
   (local wayback)
3. Execute Tool’s Archiving Procedure
4. Verify replayability in wayback



8/3/2012               MS Thesis - August 2012   52
Existing Tools’ Shortcoming:
                 Facebook Data Dump

• Lose look & feel
• FB decides what is
  preserved
• Unreliable
  (requests not
  always answered)



8/3/2012            MS Thesis - August 2012   53
Existing Tools’ Shortcoming:
                  “Save Webpage As”

• Metadata is Lost
• Archive is not
  Self-Contained
• Archive is not
  interoperable with
  Archive Replay Systems (e.g. wayback)



8/3/2012             MS Thesis - August 2012   54
Existing Tools’ Shortcoming:
                      warc-tools

• No archive creation facility
• Relies on incomplete WARC
  spec (like WARCreate)
• Only command-line access:
  suitable for sysadmins and
  power users



8/3/2012            MS Thesis - August 2012   55
Existing Tools’ Shortcoming:
                                wget &wget-warc

• No content manipulation
• Require CLI interaction
    – Issue for Ajax driven
      content (no JS support)
• wget-warc
    – Ext. of wget w/ WARC I/O
• No look & feel
  preservation




 8/3/2012                         MS Thesis - August 2012   56
Existing Tools’ Shortcoming:
                   Archive Facebook

•   Output is not compatible w/ Wayback
•   Prone to breaking when FB hierarchy changed
•   Limited to Firefox web browser
•   Cannot escape browser sandbox for portable
    archives




8/3/2012             MS Thesis - August 2012   57
Existing Tools’ Shortcoming:
                     WARCreate

• No built-in sequential
  archiving
• Relies on subset of
  WARC spec
• Limited to Chrome




8/3/2012            MS Thesis - August 2012   58
Shortcoming of Spec
• Relies on accessible URIs of sites’ sections
      – If base page content does not have a URI
        mapping, no reference exists to direct the browser
• Not comprehensive of Social Media sites
• Likely doesn’t account for some section types




8/3/2012                 MS Thesis - August 2012         59
Future Work
• Expand spec website coverage
• Account for sites w/o clearly accessible URIs
• WARCreate to implement whole official WARC
  standard
• Other SocialMediaWebsitePreprocessor types
• Address perspective issues
      – Personalization/Auth, context, archive vs. backup


8/3/2012                 MS Thesis - August 2012            60
Contributions
1. Highlight Personal Web Archiving difficulties
      –    ways they can be addressed
2. Provide remote spec for PWA tools to use to be
   more robust to sites’ hierarchy changes
3. Create tool (WARCreate)
     – allows content behind auth to be preserved to
       standard format
4. Leverage client-side server to exec scripts in
   support of personal web preservation
5. Establish section commonality between social
   media websites
8/3/2012                  MS Thesis - August 2012      61
Conclusions
• Personal web archiving has unique problems
  not exhibited in conventional web archiving
• Tools become more adaptive by utilizing
  proposed spec
• Browsers can be used as medium for
  preservation of personal web content
• With little work, server technologies can help
  to ease the task of personal web archiving

8/3/2012            MS Thesis - August 2012        62
WARCreate-Related Presentations



    ACM/IEEE Joint Conference on Digital Libraries   Digital Preservation 2012    Innovation Award by NDSA/Library of Congress
                     JCDL ‘12                                                                   For WARCreate

Mat Kelly (Old Dominion University, Norfolk, VA), Michele C. Weigle (Old Dominion University, Norfolk, VA), Michael Nelson (Old Dominion
     University, Norfolk, VA). "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools
     Demo Session: Web Archiving; 2012 Jul 25; Washington, DC.
Mat Kelly (Old Dominion University, Norfolk, VA) and Michele C. Weigle (Old Dominion University, Norfolk, VA), "WARCreate - Create
     Wayback-Consumable WARC Files from Any Webpage (demo)," In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
     (JCDL). Washington, DC, June 2012


                          For more information on:
                                   WARCreate: http://warcreate.com
                                   Archive Facebook: http://bit.ly/archivefb

       8/3/2012                                            MS Thesis - August 2012                                                  63
Example: Implicit Recursion




8/3/2012            MS Thesis - August 2012   65

Contenu connexe

En vedette

Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoAvoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoShawn Jones
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionSawood Alam
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingYasmin AlNoamany, PhD
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Tobias Kuhn
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesSawood Alam
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingSawood Alam
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchSawood Alam
 

En vedette (8)

Avoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using MementoAvoiding Spoilers On MediaWiki Fan Sites Using Memento
Avoiding Spoilers On MediaWiki Fan Sites Using Memento
 
Web Archiving: A Brief Introduction
Web Archiving: A Brief IntroductionWeb Archiving: A Brief Introduction
Web Archiving: A Brief Introduction
 
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through StorytellingUsing Web Archives to Enrich the Live Web Experience Through Storytelling
Using Web Archives to Enrich the Live Web Experience Through Storytelling
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linke...
 
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web ArchivesInterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
InterPlanetary Wayback: Peer-To-Peer Permanence of Web Archives
 
TPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive ProfilingTPDL 2016 Doctoral Consortium - Web Archive Profiling
TPDL 2016 Doctoral Consortium - Web Archive Profiling
 
Web Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext SearchWeb Archive Profiling Through Fulltext Search
Web Archive Profiling Through Fulltext Search
 

Similaire à An Extensible Framework for Creating Personal Web Archives of Content Behind Authentication

BlogForever eChallenges 2012
BlogForever eChallenges 2012BlogForever eChallenges 2012
BlogForever eChallenges 2012BlogForever
 
SharePoint Web Content Management - Lessons Learnt/top 5 tips
SharePoint Web Content Management - Lessons Learnt/top 5 tipsSharePoint Web Content Management - Lessons Learnt/top 5 tips
SharePoint Web Content Management - Lessons Learnt/top 5 tipsChris O'Brien
 
JIO and WebViewers: interoperability for Javascript and Web Applications
JIO and WebViewers: interoperability  for Javascript and Web ApplicationsJIO and WebViewers: interoperability  for Javascript and Web Applications
JIO and WebViewers: interoperability for Javascript and Web ApplicationsXWiki
 
JavaOne2013 Leveraging Linked Data and OSLC
JavaOne2013 Leveraging Linked Data and OSLCJavaOne2013 Leveraging Linked Data and OSLC
JavaOne2013 Leveraging Linked Data and OSLCSteve Speicher
 
Open standards and open source mean open for business cms expo session mc-k...
Open standards and open source mean open for business   cms expo session mc-k...Open standards and open source mean open for business   cms expo session mc-k...
Open standards and open source mean open for business cms expo session mc-k...Cheryl McKinnon
 
Alabfi em-20120624
Alabfi em-20120624Alabfi em-20120624
Alabfi em-20120624zepheiraorg
 
BIBFRAME Transisition Update
BIBFRAME Transisition UpdateBIBFRAME Transisition Update
BIBFRAME Transisition Updatezepheiraorg
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...hannonhill
 
Introduction to Visual studio 2012
Introduction to Visual studio 2012 Introduction to Visual studio 2012
Introduction to Visual studio 2012 Prashant Chaudhary
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the InternetIRJET Journal
 
Client side storage on the modern web
Client side storage on the modern webClient side storage on the modern web
Client side storage on the modern webRajasekharan Vengalil
 

Similaire à An Extensible Framework for Creating Personal Web Archives of Content Behind Authentication (20)

BlogForever eChallenges 2012
BlogForever eChallenges 2012BlogForever eChallenges 2012
BlogForever eChallenges 2012
 
SharePoint Web Content Management - Lessons Learnt/top 5 tips
SharePoint Web Content Management - Lessons Learnt/top 5 tipsSharePoint Web Content Management - Lessons Learnt/top 5 tips
SharePoint Web Content Management - Lessons Learnt/top 5 tips
 
JIO and WebViewers: interoperability for Javascript and Web Applications
JIO and WebViewers: interoperability  for Javascript and Web ApplicationsJIO and WebViewers: interoperability  for Javascript and Web Applications
JIO and WebViewers: interoperability for Javascript and Web Applications
 
JavaOne2013 Leveraging Linked Data and OSLC
JavaOne2013 Leveraging Linked Data and OSLCJavaOne2013 Leveraging Linked Data and OSLC
JavaOne2013 Leveraging Linked Data and OSLC
 
gupea_2077_38605_1
gupea_2077_38605_1gupea_2077_38605_1
gupea_2077_38605_1
 
dmBridge & dmMonocle
dmBridge & dmMonocledmBridge & dmMonocle
dmBridge & dmMonocle
 
Architecting for failure
Architecting for failureArchitecting for failure
Architecting for failure
 
NoTube: Models & Semantics
NoTube: Models & SemanticsNoTube: Models & Semantics
NoTube: Models & Semantics
 
Web storage
Web storage Web storage
Web storage
 
Web 1.0, 3.0. 3.0 School of Business
Web 1.0, 3.0. 3.0 School of BusinessWeb 1.0, 3.0. 3.0 School of Business
Web 1.0, 3.0. 3.0 School of Business
 
Web 1.0, 2.0 & 3.0
Web 1.0, 2.0 & 3.0Web 1.0, 2.0 & 3.0
Web 1.0, 2.0 & 3.0
 
Open standards and open source mean open for business cms expo session mc-k...
Open standards and open source mean open for business   cms expo session mc-k...Open standards and open source mean open for business   cms expo session mc-k...
Open standards and open source mean open for business cms expo session mc-k...
 
Alabfi em-20120624
Alabfi em-20120624Alabfi em-20120624
Alabfi em-20120624
 
BIBFRAME Transisition Update
BIBFRAME Transisition UpdateBIBFRAME Transisition Update
BIBFRAME Transisition Update
 
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
Faster! Optimize Your Cascade Server Experience, by Justin Klingman, Beacon T...
 
Introduction to Visual studio 2012
Introduction to Visual studio 2012 Introduction to Visual studio 2012
Introduction to Visual studio 2012
 
Drupal 8
Drupal 8Drupal 8
Drupal 8
 
No More SQL
No More SQLNo More SQL
No More SQL
 
Time -Travel on the Internet
Time -Travel on the InternetTime -Travel on the Internet
Time -Travel on the Internet
 
Client side storage on the modern web
Client side storage on the modern webClient side storage on the modern web
Client side storage on the modern web
 

Plus de Mat Kelly

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkMat Kelly
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderMat Kelly
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesMat Kelly
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Mat Kelly
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesMat Kelly
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...Mat Kelly
 
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Mat Kelly
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryMat Kelly
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mat Kelly
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Mat Kelly
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital PreservationMat Kelly
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Mat Kelly
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemMat Kelly
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013Mat Kelly
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedMat Kelly
 

Plus de Mat Kelly (16)

Aggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity FrameworkAggregating Private and Public Web Archives Using the Mementity Framework
Aggregating Private and Public Web Archives Using the Mementity Framework
 
Client-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer HeaderClient-Assisted Memento Aggregation Using the Prefer Header
Client-Assisted Memento Aggregation Using the Prefer Header
 
A Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web ArchivesA Framework for Aggregating Public and Private Web Archives
A Framework for Aggregating Public and Private Web Archives
 
Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count Impact of URI Canonicalization on Memento Count
Impact of URI Canonicalization on Memento Count
 
Exploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web ArchivesExploring Aggregation of Personal, Private, and Institutional Web Archives
Exploring Aggregation of Personal, Private, and Institutional Web Archives
 
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...JCDL 2015 Doctoral Consortium - A Framework for AggregatingPrivate and Publi...
JCDL 2015 Doctoral Consortium - A Framework for Aggregating Private and Publi...
 
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
Visualizing Digital Collections of Web Archives from Columbia Web Archiving C...
 
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite ImageryFacilitation of the A Posteriori Replication of Web Published Satellite Imagery
Facilitation of the A Posteriori Replication of Web Published Satellite Imagery
 
Slides
SlidesSlides
Slides
 
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
Mink: Integrating the Live and Archived Web Viewing Experience Using Web Brow...
 
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
Efficient Thumbnail Generation for Web Archives at Digital Preservation 2014
 
Browser-Based Digital Preservation
Browser-Based Digital PreservationBrowser-Based Digital Preservation
Browser-Based Digital Preservation
 
Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013Archive What I See Now - Archive-It Partner Meeting 2013 2013
Archive What I See Now - Archive-It Partner Meeting 2013 2013
 
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction SystemIEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
IEEE VIS 2013 Graph-Based Navigation of a Box Office Prediction System
 
Digital Preservation 2013
Digital Preservation 2013Digital Preservation 2013
Digital Preservation 2013
 
The Revolution Will Not Be Archived
The Revolution Will Not Be ArchivedThe Revolution Will Not Be Archived
The Revolution Will Not Be Archived
 

Dernier

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Dernier (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

An Extensible Framework for Creating Personal Web Archives of Content Behind Authentication

  • 1. An Extensible Framework for Creating Personal Web Archives of Content Behind Authentication Mat Kelly Director: Michele C. Weigle Committee: Michael L. Nelson Yaohang Li 8/3/2012 MS Thesis - August 2012
  • 2. Background • Internet Archive crawls and preserves webpages creating web archives • Only public sites are preserved 8/3/2012 MS Thesis - August 2012 2
  • 3. Problems • A lot of content on web is not preserved – e.g., Social media content • As more people document lives on social media, importance of preserving becomes greater • Content not preserved = heritage lost 8/3/2012 MS Thesis - August 2012 3
  • 4. Problems: Unsuitability of Institutional Tools • Overhead and learning curve is steep • Institutional tools meant for larger scale 8/3/2012 MS Thesis - August 2012 4
  • 5. Problems: Complete Lack of Preservation 8/3/2012 MS Thesis - August 2012 5
  • 6. State of the Art in Personal Web Archiving • Personal web archiving tools – Break when target sites’ hierarchy changes – Produce sub-optimal archives • Some conventional web archiving practices not easily translatable to personal web archiving 8/3/2012 MS Thesis - August 2012 6
  • 7. Goals of Thesis • Show social media content can be preserved – With output more optimal than current offerings • Remedy the tools’ breaking problem – Remotely specify target sites’ hierarchies – Show spec is easily adaptable to tools • Identify and consider solutions to domain- specific nuances • Establish section commonality between social media websites 8/3/2012 MS Thesis - August 2012 7
  • 8. Extent of the Unpreserved 8/3/2012 MS Thesis - August 2012 8
  • 9. Ways to Capture Missing Content: Supply crawler with auth credentials • Unsuitable for institutional crawlers • Other Personal Web Archiving problems remain 8/3/2012 MS Thesis - August 2012 9
  • 10. Ways to Capture Missing Content: “Save As” Desired Pages • Miss metadata • Doesn’t produce interoperable output 8/3/2012 MS Thesis - August 2012 10
  • 11. Ways to Capture Missing Content: Utilize Fetching Tools – Lose look & feel – Difficult capturing all content desired – Frequently sub- optimal output format 8/3/2012 MS Thesis - August 2012 11
  • 12. Tools Utilized In Thesis: Archive Facebook • Firefox add-on • Creates navigable “web archives” • Outputs files w/ original file type • Sequential Archiving 8/3/2012 MS Thesis - August 2012 12
  • 13. Tools Utilized In Thesis: WARCreate • Google Chrome extension • Creates Wayback- Compatible Web ARChive (WARC) files • Allows page manipulation prior to generating archive 8/3/2012 MS Thesis - August 2012 13
  • 14. Integration with Other Tools • Wayback (WARC replay system) – Allows WARCreate output to be re-experienced – Provides content for Memento • Memento – Allows temporal traversal of archived pages – Timegate serves as relay only to local wayback instance • XAMPP (Client-Side Server Suite) – Overcome Javascript inadequacies – Provide foundation for replay system 8/3/2012 MS Thesis - August 2012 14
  • 15. Institutional vs. Personal Web Archiving 8/3/2012 MS Thesis - August 2012 15
  • 16. Institutional vs. Personal Web Archiving 8/3/2012 MS Thesis - August 2012 16
  • 17. Institutional vs. Personal Web Archiving Crawls WWW 8/3/2012 MS Thesis - August 2012 17
  • 18. Institutional vs. Personal Web Archiving Crawls WWW 8/3/2012 MS Thesis - August 2012 18
  • 19. Institutional vs. Personal Web Archiving Crawls WWW outputs WARC 8/3/2012 MS Thesis - August 2012 19
  • 20. Institutional vs. Personal Web Archiving Crawls WWW outputs WARC 8/3/2012 MS Thesis - August 2012 20
  • 21. Institutional vs. Personal Web Archiving Crawls WWW outputs WARC 8/3/2012 MS Thesis - August 2012 21
  • 22. Institutional vs. Personal Web Archiving Crawls WWW outputs Publicly viewable WARC Archive replay 8/3/2012 MS Thesis - August 2012 22
  • 23. Institutional vs. Personal Web Archiving 8/3/2012 MS Thesis - August 2012 23
  • 24. Institutional vs. Personal Web Archiving 8/3/2012 MS Thesis - August 2012 24
  • 25. Institutional vs. Personal Web Archiving 8/3/2012 MS Thesis - August 2012 25
  • 26. Institutional vs. Personal Web Archiving WARC 8/3/2012 MS Thesis - August 2012 26
  • 27. Institutional vs. Personal Web Archiving WARC 8/3/2012 MS Thesis - August 2012 27
  • 28. Institutional vs. Personal Web Archiving WARC 8/3/2012 MS Thesis - August 2012 28
  • 29. Institutional vs. Personal Web Archiving WARC 8/3/2012 MS Thesis - August 2012 29
  • 30. Problems Specific to Personal Web Archiving • Personalization/Authentication – Different users, facebook.com, different content • Context – Different browsing tools, different site experience • Output Format – Ad hoc approaches are often used that lose metadata, context, content, etc. 8/3/2012 MS Thesis - August 2012 30
  • 31. Personalization/Authentication • Two users, same URI, vastly different content • One user, same URI, authentication vs. no authentication, different content – As shown in IA’s archive of FB 8/3/2012 MS Thesis - August 2012 31
  • 32. Context • Same URI+diff devices = diff content served • Mobile vs. PC • Firefox vs. Chrome <!--[if lt IE 5]> Your browser is too old and cannot render this content. <![endif]--> <!--[if gte IE 9]> ...features not supported by version of IE prior to 9... <![endif]--> 8/3/2012 MS Thesis - August 2012 32
  • 33. Output Format 8/3/2012 MS Thesis - August 2012 33
  • 34. Output Format • Saving only HTML is not enough • Local references need manipulation • Browser alone is insufficient replay system 8/3/2012 MS Thesis - August 2012 34
  • 35. Output Format • Misses HTTP headers REQUEST GET / HTTP/1.1 Host: www.facebook.com User-Agent: • Request & Response NOT CAPTURED BY BACKUP TOOLS/METHODS Mozilla/5.0 (Windows NT 5.1; rv:14.0) Gecko/20100101 Firefox/14.0.1 Accept: • e.g., Auth text/html,application/xhtml+xml,application/xml;q=0 .9,*/*;q=0.8 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip, deflate Connection: keep- • If headers included, alive Cookie: datr=KMo6T3jicPEdEl4pY2yFnr6F; lu=TgU4dhoSBG0ZmEnThtLeyqIA; inputs for personalization c_user=100003509861423; fr=0KMqEWNPPgver2SIx.AWXf- 6Ww_7iQFPPP9sFtiiMPaV0; s=Aa4dL41H8UGZ-4Lf.BQGryl; xs=1%3Am7APtmN9-ev4Vg%3A0%3A1343929509; can be viewed act=1343929622029%2F3%3A2; p=1; presence=EM343929627EuserFA21B03509861423A2EstateFD sb2F0Et2F_5b_5dElm2FnullEuct2F1343929017BEtrFnullEt wF3302582290EatF1343929627063EutF0EsndF1EnotF0CEchF RESPONSE Dp_5f1B03509861423F1CC HTTP/1.1 200 OK Cache-Control: private, no- cache, no-store, must-revalidate Expires: Sat, 01 Jan 2000 00:00:00 GMT P3P: CP="Facebook does not have a P3P policy. Learn why here: http://fb.me/p3p" Pragma: no-cache X-Content-Type- Options: nosniff x-frame-options: DENY X-XSS- Protection: 1; mode=block Content-Encoding: gzip Content-Type: text/html; charset=utf-8 X-FB-Debug: uMXm8343NOn0OOIeDna2teVECApUiEqj6s7GTwNx+Ss= Date: Thu, 02 Aug 2012 19:26:12 GMT Transfer-Encoding: chunked Connection: keep-alive 8/3/2012 MS Thesis - August 2012 35
  • 36. Specification and OOP • Sites’ hierarchies resemble OOP concepts (polymorphism, inheritance) • Sites’ sections can be represented as classes • Classes converted to XML specification • Personal Web Archiving tools utilize this specification to become adaptive 8/3/2012 MS Thesis - August 2012 36
  • 37. Commonality of “Sections” Between Social Media Websites Abstracted media type personal stream wall posts my tweets global stream news feed streams followees’ tweets multimedia - photos photos photos multimedia - videos videos videos photo collection albums posts notes friends friends circles 8/3/2012 MS Thesis - August 2012 37
  • 38. Example: Facebook Section Objects SocialMediaWebsite facebook = new SocialMediaWebsite(homepage => "http://www.facebook.com") facebook->decorate([ new SocialMediaWebsiteSectionPersonalStream( name => "Wall", url => "http://www.facebook.com/profile.php?sk=wall", preprocessor => new SocialMediaScrollPrepreprocessor( timeBetweenFirings => 0, maxFirings = 0, conditionBeforeSubsequentFirings = null ) ), new SocialMediaWebsiteSectionUserInfo( name => "Info", url => "http://www.facebook.com/profile.php?sk=info" ), new SocialMediaWebsiteSectionMultimediaCollection( name => "Photos", url => "http://www.facebook.com/profile.php?sk=photos", proprocessor => new SocialMediaScrollPreprocessor( timeBetweenFirings => 0, maxFirings => 0, conditionBeforeSubsequentFirings = null ) ), ... 8/3/2012 MS Thesis - August 2012 38
  • 39. Example: Facebook Section Objects SocialMediaWebsite facebook = new SocialMediaWebsite(homepage => "http://www.facebook.com") facebook->decorate([ new SocialMediaWebsiteSectionPersonalStream( name => "Wall", url => "http://www.facebook.com/profile.php?sk=wall", preprocessor => new SocialMediaScrollPrepreprocessor( timeBetweenFirings => 0, maxFirings = 0, conditionBeforeSubsequentFirings = null ) ), new SocialMediaWebsiteSectionUserInfo( name => "Info", url => "http://www.facebook.com/profile.php?sk=info" ), new SocialMediaWebsiteSectionMultimediaCollection( name => "Photos", url => "http://www.facebook.com/profile.php?sk=photos", proprocessor => new SocialMediaScrollPreprocessor( timeBetweenFirings => 0, maxFirings => 0, conditionBeforeSubsequentFirings = null ) ), ... 8/3/2012 MS Thesis - August 2012 39
  • 40. Example: Facebook Section Objects SocialMediaWebsite facebook = new SocialMediaWebsite(homepage => "http://www.facebook.com") facebook->decorate([ new SocialMediaWebsiteSectionPersonalStream( name => "Wall", url => "http://www.facebook.com/profile.php?sk=wall", preprocessor => new SocialMediaScrollPrepreprocessor( timeBetweenFirings => 0, maxFirings = 0, conditionBeforeSubsequentFirings = null ) ), new SocialMediaWebsiteSectionUserInfo( name => "Info", url => "http://www.facebook.com/profile.php?sk=info" ), new SocialMediaWebsiteSectionMultimediaCollection( name => "Photos", url => "http://www.facebook.com/profile.php?sk=photos", proprocessor => new SocialMediaScrollPreprocessor( timeBetweenFirings => 0, maxFirings => 0, conditionBeforeSubsequentFirings = null ) ), ... 8/3/2012 MS Thesis - August 2012 40
  • 41. Example: Hierarchical Similarities SocialMediaWebsite facebook = new SocialMediaWebsite(homepage => "http://www.facebook.com") facebook->decorate([ new SocialMediaWebsiteSectionPersonalStream( name => "Wall", url => "http://www.facebook.com/profile.php?sk=wall", preprocessor => new SocialMediaScrollPrepreprocessor( timeBetweenFirings => 0, maxFirings = 0, conditionBeforeSubsequentFirings = null ) ), new SocialMediaWebsiteSectionUserInfo( name => "Info", url => "http://www.facebook.com/profile.php?sk=info" ), new SocialMediaWebsiteSectionMultimediaCollection( name => "Photos", url => "http://www.facebook.com/profile.php?sk=photos", proprocessor => new SocialMediaScrollPreprocessor( timeBetweenFirings => 0, maxFirings => 0, conditionBeforeSubsequentFirings = null ) ), ... 8/3/2012 MS Thesis - August 2012 41
  • 42. Spec Retrieval Process 1. Tool accesses root spec Root Site w/ URI parameter Spec Spec 2. Spec returns with reference to site- (spec)/facebook.xml specific hierarchy spec 3. Tool fetches site spec 4. Updated site hierarchy returned 8/3/2012 MS Thesis - August 2012 42
  • 43. Concrete Usage – Tool Adaptation • Archive Facebook – Map current URIs to remotely fetched URIs – Perform pre-processing defined in FB spec • WARCreate – Implement sequential/cohesive archiving 8/3/2012 MS Thesis - August 2012 43
  • 44. Evaluation 1:Tool Adaptability 1. Setup synthetic social media website 2. Define site’s remote spec 3. Change AFB to preserve synthetic site 4. Change hierarchy of synthetic site 5. Show AFB breaking 6. Change synthetic site spec 7. Show AFB functionality restored 8/3/2012 MS Thesis - August 2012 44
  • 45. Evaluation 1: Tool Adaptability Step 1: Synthetic Site Creation • Simple hierarchy for base case testing • Requires Auth • Utilizes CDN • Can be manipulated • Recursive Sections 8/3/2012 MS Thesis - August 2012 45
  • 46. Evaluation 1: Tool Adaptability Step 2: Define Site Remove Spec <?xml version="1.0" ?> <socialMediaWebsite> <homepage>http://test.socialstandard.org</homepage> <sections> <socialMediaWebsiteSection type="SocialMediaWebsiteSectionPersonalStream"> <name>Personal Stream</name> <url>http://test.socialstandard.org/personal</url> <preprocessor type="SocialMediaScrollPreprocessor"> <timeBetweenFirings>0</timeBetweenFirings> <maxFirings>0</maxFirings> <conditionBeforeSubsequentFiring>?</conditionBeforeSubsequentFiring> </preprocessor> </socialMediaWebsiteSection> <socialMediaWebsiteSection type="SocialMediaWebsiteSectionMultimediaCollection"> <name>Photo Albums</name> <url>http://test.socialstandard.org/albums</url> <preprocessor type="SocialMediaScrollPreprocessor"> <timeBetweenFirings>0</timeBetweenFirings> <maxFirings>0</maxFirings> <conditionBeforeSubsequentFiring>?</conditionBeforeSubsequentFiring> </preprocessor> <children> <regex>&lt;div class="album.*&lt;ashref="(.*)"</regex> <type>SocialMediaWebsiteSectionMultimediaCollection</type> <name>Photo Album</name> </children> </socialMediaWebsiteSection> <socialMediaWebsiteSection type="SocialMediaWebsiteSectionMultimediaCollection"> <name>Photo Album</name> <url>http://test.socialstandard.org/album/[a-zA-Z0-9]+</url> <preprocessor type="SocialMediaScrollPreprocessor"> <timeBetweenFirings>0</timeBetweenFirings> <maxFirings>0</maxFirings> <conditionBeforeSubsequentFiring>?</conditionBeforeSubsequentFiring> 8/3/2012 MS Thesis - August 2012 </preprocessor> 46 <children> <regex>&lt;div class="album.*&lt;ashref="(album/[a-zA-Z0-9]+)"</regex>
  • 47. Evaluation 1: Tool Adaptability Step 3: Change AFB to preserve synthetic site getCurrentSiteSpec : function(step,urlIn,hostIn){ • Utilize existing capture switch(step){ case 0: var xhr = new XMLHttpRequest(); var siteSpec = "", uriOut = ""; mechanisms $.ajax({ url: urlIn, success: function(data){ • Exploit guaranteed var host = "www.facebook.com"; //hostIn n/a here var parser = new DOMParser(); var socialMediaWebsites = $(data.childNodes[0]).children attributes (e.g., host) for(var i=0; i<socialMediaWebsites.length; i++){ var smw = socialMediaWebsites[i]; if($(smw).find("homepage").text().indexOf(host) != -1) siteSpec = $(smw).find("specification").text(); • Make code general getCurrentSiteSpec(1,siteSpec,host); } //fi } //rof enough to be widely }, error: function(){} }); //xaja break; applicable to sections case 1: $.ajax({ url: urlIn, success: function(data){ var ls = window.content.localStorage; ls.setItem("spec", (new XMLSerializer()).serializeToStr archivefbBrowserOverlay.capture(ls.getItem("spec")); }, error : function(){} }; break; } } 8/3/2012 MS Thesis - August 2012 47
  • 48. Evaluation 1: Tool Adaptability Step 4: Change hierarchy of synthetic site • Simulate simply through mod_rewrite • Previously: RewriteRule ^personal$ index.php?section=personal [NC] • Updated: RewriteRule ^myfeed$ index.php?section=personal [NC] • Disavow previous reference altogether to ensure 404 8/3/2012 MS Thesis - August 2012 48
  • 49. Evaluation 1: Tool Adaptability Step 5: Show AFB breaking • Run archiving procedure again, note failing of procedure or content not captured 8/3/2012 MS Thesis - August 2012 49
  • 50. Evaluation 1: Tool Adaptability Step 6: Change synthetic site spec <?xml version="1.0" ?> <?xml version="1.0" ?> <socialMediaWebsite> <socialMediaWebsite> <homepage>http://test.socialstandard.org</homepage> <homepage>http://test.socialstandard.org</homepage> <sections> <sections> <socialMediaWebsiteSection <socialMediaWebsiteSection type="SocialMediaWebsiteSectionPersonalStream"> type="SocialMediaWebsiteSectionPersonalStream"> <name>Personal Stream</name> <name>Personal Stream</name> <url>http://test.socialstandard.org/personal</url> <url>http://test.socialstandard.org/myfeed</url> <preprocessor <preprocessor type="SocialMediaScrollPreprocessor"> type="SocialMediaScrollPreprocessor"> <timeBetweenFirings>0</timeBetweenFirings> <timeBetweenFirings>0</timeBetweenFirings> <maxFirings>0</maxFirings> <maxFirings>0</maxFirings> <conditionBeforeSubsequentFiring>?</conditionBefo <conditionBeforeSubsequentFiring>?</conditionBefo reSubsequentFiring> reSubsequentFiring> </preprocessor> </preprocessor> </socialMediaWebsiteSection> </socialMediaWebsiteSection> … … 8/3/2012 MS Thesis - August 2012 50
  • 51. Evaluation 1: Tool Adaptability Step 7: Show AFB functionality restored • Execute archiving procedure of tool w/o modifying code • Show that result matches step 1 8/3/2012 MS Thesis - August 2012 51
  • 52. Evaluation 2: Preservation of Content Behind Authentication 1. Create tool (WARCreate) to store to WARC format 2. Setup easy-to-use Replay system (local wayback) 3. Execute Tool’s Archiving Procedure 4. Verify replayability in wayback 8/3/2012 MS Thesis - August 2012 52
  • 53. Existing Tools’ Shortcoming: Facebook Data Dump • Lose look & feel • FB decides what is preserved • Unreliable (requests not always answered) 8/3/2012 MS Thesis - August 2012 53
  • 54. Existing Tools’ Shortcoming: “Save Webpage As” • Metadata is Lost • Archive is not Self-Contained • Archive is not interoperable with Archive Replay Systems (e.g. wayback) 8/3/2012 MS Thesis - August 2012 54
  • 55. Existing Tools’ Shortcoming: warc-tools • No archive creation facility • Relies on incomplete WARC spec (like WARCreate) • Only command-line access: suitable for sysadmins and power users 8/3/2012 MS Thesis - August 2012 55
  • 56. Existing Tools’ Shortcoming: wget &wget-warc • No content manipulation • Require CLI interaction – Issue for Ajax driven content (no JS support) • wget-warc – Ext. of wget w/ WARC I/O • No look & feel preservation 8/3/2012 MS Thesis - August 2012 56
  • 57. Existing Tools’ Shortcoming: Archive Facebook • Output is not compatible w/ Wayback • Prone to breaking when FB hierarchy changed • Limited to Firefox web browser • Cannot escape browser sandbox for portable archives 8/3/2012 MS Thesis - August 2012 57
  • 58. Existing Tools’ Shortcoming: WARCreate • No built-in sequential archiving • Relies on subset of WARC spec • Limited to Chrome 8/3/2012 MS Thesis - August 2012 58
  • 59. Shortcoming of Spec • Relies on accessible URIs of sites’ sections – If base page content does not have a URI mapping, no reference exists to direct the browser • Not comprehensive of Social Media sites • Likely doesn’t account for some section types 8/3/2012 MS Thesis - August 2012 59
  • 60. Future Work • Expand spec website coverage • Account for sites w/o clearly accessible URIs • WARCreate to implement whole official WARC standard • Other SocialMediaWebsitePreprocessor types • Address perspective issues – Personalization/Auth, context, archive vs. backup 8/3/2012 MS Thesis - August 2012 60
  • 61. Contributions 1. Highlight Personal Web Archiving difficulties – ways they can be addressed 2. Provide remote spec for PWA tools to use to be more robust to sites’ hierarchy changes 3. Create tool (WARCreate) – allows content behind auth to be preserved to standard format 4. Leverage client-side server to exec scripts in support of personal web preservation 5. Establish section commonality between social media websites 8/3/2012 MS Thesis - August 2012 61
  • 62. Conclusions • Personal web archiving has unique problems not exhibited in conventional web archiving • Tools become more adaptive by utilizing proposed spec • Browsers can be used as medium for preservation of personal web content • With little work, server technologies can help to ease the task of personal web archiving 8/3/2012 MS Thesis - August 2012 62
  • 63. WARCreate-Related Presentations ACM/IEEE Joint Conference on Digital Libraries Digital Preservation 2012 Innovation Award by NDSA/Library of Congress JCDL ‘12 For WARCreate Mat Kelly (Old Dominion University, Norfolk, VA), Michele C. Weigle (Old Dominion University, Norfolk, VA), Michael Nelson (Old Dominion University, Norfolk, VA). "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage," Digital Preservation 2012, Tools Demo Session: Web Archiving; 2012 Jul 25; Washington, DC. Mat Kelly (Old Dominion University, Norfolk, VA) and Michele C. Weigle (Old Dominion University, Norfolk, VA), "WARCreate - Create Wayback-Consumable WARC Files from Any Webpage (demo)," In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL). Washington, DC, June 2012 For more information on: WARCreate: http://warcreate.com Archive Facebook: http://bit.ly/archivefb 8/3/2012 MS Thesis - August 2012 63
  • 64. Example: Implicit Recursion 8/3/2012 MS Thesis - August 2012 65

Notes de l'éditeur

  1. Maintaining record of web is preserving digital heritageInternet Archive crawls and preserves webpages creating web archivesPreserved pages replayable at archive.orgOnly publicly accessible sites preserved
  2. Works well if you expend the energy to learn
  3. Internet Archive (IA) captured only public webCrawlers miss content behind authenticationQuantity of content behind auth &gt; public web∴ Large amount of content is not preserved
  4. Shows result from AFB / save webpage asFiles chaotically named
  5. State of resource subject to inputsBrowser sends but usually hides headers, not caught on capture by AFBREQUEST headers are important, allow overcome context/personalization issues
  6. Quick to fail, note whitespace issues