Boost Fertility New Invention Ups Success Rates.pdf
Mezi snem a realitou. Otevřená data českého webového archivu.
1. Webarchiv
Památník českého internetu, více
OpenAlt 2016
Mezi snem a realitou.
Otevřená data českého webového archivu.
http://www.slideshare.net/webarchivCZ/presentations
2. Proč archivujeme web?
Kdo a jak archivuje web?
Metadata
Rudolf.Kreibich@nkp.cz
vedoucí podpory aplikací NK ČR
5. “… více jak 70% URL v Harvard Law
Review a 50% URL v nálezích nejvyššího
soudu Spojených států amerických, odkazuje
k již neexistujícímu webovému zdroji. “
Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Jonathan Zittrain,
Kendra Albert a Lawrence Lessig. Legal Information Management / Volume 14 / Issue 02 / June 2014, pp 88-99,
DOI: http://dx.doi.org/10.1017/S1472669614000255, Published online: 12 June 2014
6.
7. 404 Not Found
The 404 (Not Found) status code indicates that the origin server did
not find a current representation for the target resource or is not
willing to disclose that one exists. A 404 status code does not
indicate whether this lack of representation is temporary or
permanent; the 410 (Gone) status code is preferred over 404 if the
origin server knows, presumably through some configurable means, that
the condition is likely to be permanent.
A 404 response is cacheable by default; i.e., unless otherwise
indicated by the method definition or explicit cache controls (see
Section 4.2.2 of [RFC7234]).
9. “Je snažší nalézt exemplář filmu z roku
1924, než webové stránky z roku 1994.”
M.S. Ankerson. “Writing web histories with an eye on the analog past.” 2012.
http://nms.sagepub.com/content/14/3/384.full.pdf+html
10. “Bude možné studovat naše století bez
webových archivů?”
Ian Milligan, Professor in the Department of History at the University of Waterloo.
18. IIPC | Internationl Internet Preservation Consortium
Složení členů
2x Regionální knihovny
32x Národní knihovny (včetně ČR)
3x Neziskové organizace
9x Výzkumné organizace nebo univerzity
http://netpreserve.org/about-us/members
21. Temný věk Java Scriptu
“Brozzler is a distributed web crawler
(爬⾍) that uses a real browser (chrome
or chromium) to fetch pages and
embedded urls and to extract links.”
https://github.com/internetarchive/brozzler
26. … od roku 2001
~ 221 TB
~ 6 miliard digitálních objektů / URL
~1,2 miliónu domén .cz
27.
28. méně než 1 % je volně přístupné
=
~ 4738 webů z 1,2 miliónu webů
29.
30.
31. Operation | postupný přesun do Infrastructre as Code
Dobrá strana síly
Ansible
Vagrant
Packer
Docker?
…
Temná a svůdná strana
VMware vCenter
IBM GPFS
34. “The Common Crawl corpus contains petabytes of data collected
over the last 7 years.
It contains raw web page data, extracted metadata and text
extractions.
The Common Crawl dataset lives on Amazon S3 as part of the
Amazon Public Datasets program.
From Public Data Sets, you can download the files entirely free using
HTTP or S3.
As the Common Crawl Foundation has evolved over the years, so has
the format and metadata that accompany the crawls themselves.”
http://commoncrawl.org/the-data/get-started/
35.
36. “Google podle mně nearchivuje, ale
cachuje.”
já, u vícero příležitostí
39. Wayback CDX Server API
plain text or JSON array of the CDX data
urlkey: org,archive
timestamp: 19970126045828
original: http://www.archive.org:80
mimetype: text/html
statuscode: 200
digest: Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY
length: 1415
https://github.com/internetarchive/wayback/blob/master/wayback-
cdx-server/README.md
40. WAT | Metadata k archivovaným objektům | JSON
WARC-Header-Metadata:
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Type: response
WARC-Date 2014-08-02T09:52:13Z
…
Payload-Metadata:
HTTP-Response-Metadata:
Headers:
Content-Language:
Content-Encoding:
...
HTML-Metadata:
Head:
Title: BBC NEWS | Africa | Namibia braces for Nujoma exit
…
Metas:
name: keywords
content: BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service
…
Links:
href: /css/screen/shared/styles.css
path: STYLE/#text
…
http://commoncrawl.org/the-data/get-started/
https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar
42. WET | Extrahovaný fulltext
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID: <urn:uuid:007d632a-ab5a-4c4e-afc2-c455066a82de>
WARC-Refers-To: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f>
WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC
Content-Type: text/plain
Content-Length: 6724
BBC NEWS | Africa | Namibia braces for Nujoma exit
[an error occurred while processing this directive]
…
Your news when you want it
News Front Page
Africa
…
HausaPortuguese Africa More Last Updated: Thursday, 22 January, 2004, 00:48 GMT
E-mail this to a friend
Printable version
…
Swapo has been careful to secure the Ovambo vote by ploughing a large slice of development funding into the region,
and the people there get more than their fair share of government positions.
For the moment, Mr Nujoma's biggest headache is land reform. Huge tracks of land are still owned by a few white
farmers and black Namibians are impatient at the slow pace of reform. White farmers say they are falling over
backwards to please the government, but Mr Pahamba says that they are only handing over poor quality land.
Meanwhile, the militant black farmer's union is threatening farm occupations similar to those in Zimbabwe. Guard dogs
…
http://commoncrawl.org/the-data/get-started/
https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
43. LGA | Metadata pro vztahy mezi URL v čase
ID-Map
url: https://www.youtube.com/watch?v=--FDzShdFjw&gl=US&hl=en
surt_url: com,youtube)/watch?gl=us&hl=en&v=–fdzshdfjw
id: 294869
příklad
{"url":"https://www.youtube.com/watch?v=--
FDzShdFjw&gl=US&hl=en","surt_url":"com,youtube)/watch?gl=us&hl=en&v=–
fdzshdfjw","id":294869}
ID-Graph
timestamp: 20150209052911
id: 20150209052911
outilink_ids: 31, 31366, 62596, 91594, 91595, …
příklad
{“timestamp":"20150209052911","id":294869,"outlink_ids":
[31,31366,62596,91594,91595,129599, …]}
https://webarchive.jira.com/wiki/display/ARS/LGA+Overview+and+Technical+Details
44. WANE | Extrahované jmenné entity
url: http://dissonantwinstonsmith.wordpress.com/2014/08/24/im-sick-of/?
like_comment=79&_wpnonce=0fc57aa499&replytocom=93
timestamp: 20141019212346
named_entities:
locations: North County, America, St. Louis County St. Louis County
Police St. Louis County, WordPress.com, Middle East, …
organizations: Twitter Facebook Google, Google, Facebook, Wal-Mart,
CNN, Bearcats, …
persons: Stell, Tom Jackson, Smith, Pamela Fillingim, Darren Wilson
Eric Fowler Eric Vickers Ferguson Ferguson, Ferguson, …
digest: sha1:747IKFWUCVQVXY7TX2NMYFL422T4TRQX
Extrahováno se Stanford Named Entity Recognizer (NER)
http://nlp.stanford.edu/software/CRF-NER.shtml
https://webarchive.jira.com/wiki/display/ARS/
WANE+Overview+and+Technical+Details
47. Open nsfw model
“This repo contains code for running Not Suitable for Work
(NSFW) classification deep neural network Caffe models. “
https://github.com/yahoo/open_nsfw/blob/master/
54. Co s metadaty?
vývoj formátů na webu
vývoj prolinkování webů
vývoj nsfw webů na doméně
vývoj poměru grafiky / textu na webu
vývoj web technologií
…
55. Oddělení archivace webu | ODIF | NK ČR
Vedoucí: Jaroslav Kvasnica
Kurátoři: Marie Haškovcová, Monika Holoubková, Markéta
Hrdličková
IT Operation: Rudolf.Kreibich@nkp.cz
webarchiv.cz
facebook.com/webarchivcz
slideshare.net/webarchivCZ
github.com/webarchivcz