SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Webarchiv
Památník českého internetu, více
OpenAlt 2016
Mezi snem a realitou.
Otevřená data českého webového archivu.
http://www.slideshare.net/webarchivCZ/presentations
Proč archivujeme web?
Kdo a jak archivuje web?
Metadata
Rudolf.Kreibich@nkp.cz
vedoucí podpory aplikací NK ČR
Proč archivujeme web?
“… více jak 70% URL v Harvard Law
Review a 50% URL v nálezích nejvyššího
soudu Spojených států amerických, odkazuje
k již neexistujícímu webovému zdroji. “
Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Jonathan Zittrain,
Kendra Albert a Lawrence Lessig. Legal Information Management / Volume 14 / Issue 02 / June 2014, pp 88-99,
DOI: http://dx.doi.org/10.1017/S1472669614000255, Published online: 12 June 2014
404 Not Found
The 404 (Not Found) status code indicates that the origin server did
not find a current representation for the target resource or is not
willing to disclose that one exists. A 404 status code does not
indicate whether this lack of representation is temporary or
permanent; the 410 (Gone) status code is preferred over 404 if the
origin server knows, presumably through some configurable means, that
the condition is likely to be permanent.
A 404 response is cacheable by default; i.e., unless otherwise
indicated by the method definition or explicit cache controls (see
Section 4.2.2 of [RFC7234]).
✝
uri
“Je snažší nalézt exemplář filmu z roku
1924, než webové stránky z roku 1994.”
M.S. Ankerson. “Writing web histories with an eye on the analog past.” 2012. 

http://nms.sagepub.com/content/14/3/384.full.pdf+html
“Bude možné studovat naše století bez
webových archivů?”
Ian Milligan, Professor in the Department of History at the University of Waterloo.
Kdo a jak archivuje web?
“Univerzální dostupnost veškerého vědění.”
Brewster Kahle
IIPC | Internationl Internet Preservation Consortium
Složení členů
2x Regionální knihovny
32x Národní knihovny (včetně ČR)
3x Neziskové organizace
9x Výzkumné organizace nebo univerzity
http://netpreserve.org/about-us/members
Heritrix / OpenWayback
sklízení / zpřístupnění
Otevřený software
Mezinárodní komunita
https://github.com/iipc/openwayback
https://github.com/internetarchive/heritrix3
Temný věk Java Scriptu
“Brozzler is a distributed web crawler
(爬⾍) that uses a real browser (chrome
or chromium) to fetch pages and
embedded urls and to extract links.”
https://github.com/internetarchive/brozzler
Heritrix sklízí 2065 URL/s
PhantomJS sklízí 172 URL/s
=>
škálovat JS intepretory
Měsíční výběrové sklizně
Občasné tématické sklizně
Půl roční sklizně domény cz
(spolupráce s nic.cz)
… od roku 2001
~ 221 TB
~ 6 miliard digitálních objektů / URL
~1,2 miliónu domén .cz
méně než 1 % je volně přístupné
=
~ 4738 webů z 1,2 miliónu webů
Operation | postupný přesun do Infrastructre as Code
Dobrá strana síly
Ansible
Vagrant
Packer
Docker?
…
Temná a svůdná strana
VMware vCenter
IBM GPFS
http://arquivo.pt/search.jsp?l=en&query=prase
“The Common Crawl corpus contains petabytes of data collected
over the last 7 years.
It contains raw web page data, extracted metadata and text
extractions.
The Common Crawl dataset lives on Amazon S3 as part of the
Amazon Public Datasets program.
From Public Data Sets, you can download the files entirely free using
HTTP or S3.
As the Common Crawl Foundation has evolved over the years, so has
the format and metadata that accompany the crawls themselves.”
http://commoncrawl.org/the-data/get-started/
“Google podle mně nearchivuje, ale
cachuje.”
já, u vícero příležitostí
metadata
WARC | ISO 28500:2009 | Prochází revizí
WARC/1.0
WARC-Type: response
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID:
Content-Length: 43428
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID:
WARC-Concurrent-To:
WARC-IP-Address: 212.58.244.61
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Payload-Digest:
sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J
WARC-Block-Digest:
sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO
WARC-Truncated: length
Wayback CDX Server API
plain text or JSON array of the CDX data
urlkey: org,archive
timestamp: 19970126045828
original: http://www.archive.org:80
mimetype: text/html
statuscode: 200
digest: Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY
length: 1415
https://github.com/internetarchive/wayback/blob/master/wayback-
cdx-server/README.md
WAT | Metadata k archivovaným objektům | JSON
WARC-Header-Metadata:
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Type: response
WARC-Date 2014-08-02T09:52:13Z
…
Payload-Metadata:
HTTP-Response-Metadata:
Headers:
Content-Language:
Content-Encoding:
...
HTML-Metadata:
Head:
Title: BBC NEWS | Africa | Namibia braces for Nujoma exit
…
Metas:
name: keywords
content: BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service
…
Links:
href: /css/screen/shared/styles.css
path: STYLE/#text
…
http://commoncrawl.org/the-data/get-started/

https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar
WAT | Metadata k archivovaným objektům | JSON
Server response
"Headers" : {
"Date" : "Sat, 02 Aug 2014 09:52:13 GMT",
"Cache-Control" : "max-age=0",
"Connection" : "close",
"Expires" : "Sat, 02 Aug 2014 09:52:13 GMT",
"Content-Type" : "text/html",
"Server" : "Apache",
"Vary" : "X-CDN",
"Set-Cookie" :
“BBC UID=15730d9c1b741c0d3942e2aca1317fbf39e57b90be68a329d375ba9d5
a8964080CCBot%2f2%2e0%20%28http%3a%2f%2fcommoncrawl%2eorg%2ffaq
%2f%29; expires=Sun, 02-Aug-15 09:52:13 GMT; path=/; domain=bbc.co.uk;"
http://commoncrawl.org/the-data/get-started/

https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar
WET | Extrahovaný fulltext
WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm
WARC-Date: 2014-08-02T09:52:13Z
WARC-Record-ID: <urn:uuid:007d632a-ab5a-4c4e-afc2-c455066a82de>
WARC-Refers-To: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f>
WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC
Content-Type: text/plain
Content-Length: 6724
BBC NEWS | Africa | Namibia braces for Nujoma exit
[an error occurred while processing this directive]
…
Your news when you want it
News Front Page
Africa
…
HausaPortuguese Africa More Last Updated: Thursday, 22 January, 2004, 00:48 GMT
E-mail this to a friend
Printable version
…
Swapo has been careful to secure the Ovambo vote by ploughing a large slice of development funding into the region,
and the people there get more than their fair share of government positions.
For the moment, Mr Nujoma's biggest headache is land reform. Huge tracks of land are still owned by a few white
farmers and black Namibians are impatient at the slow pace of reform. White farmers say they are falling over
backwards to please the government, but Mr Pahamba says that they are only handing over poor quality land.
Meanwhile, the militant black farmer's union is threatening farm occupations similar to those in Zimbabwe. Guard dogs
…
http://commoncrawl.org/the-data/get-started/

https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
LGA | Metadata pro vztahy mezi URL v čase
ID-Map
url: https://www.youtube.com/watch?v=--FDzShdFjw&gl=US&hl=en
surt_url: com,youtube)/watch?gl=us&hl=en&v=–fdzshdfjw
id: 294869


příklad

{"url":"https://www.youtube.com/watch?v=--
FDzShdFjw&gl=US&hl=en","surt_url":"com,youtube)/watch?gl=us&hl=en&v=–
fdzshdfjw","id":294869}
ID-Graph
timestamp: 20150209052911
id: 20150209052911
outilink_ids: 31, 31366, 62596, 91594, 91595, …


příklad
{“timestamp":"20150209052911","id":294869,"outlink_ids":
[31,31366,62596,91594,91595,129599, …]}


https://webarchive.jira.com/wiki/display/ARS/LGA+Overview+and+Technical+Details
WANE | Extrahované jmenné entity
url: http://dissonantwinstonsmith.wordpress.com/2014/08/24/im-sick-of/?
like_comment=79&_wpnonce=0fc57aa499&replytocom=93
timestamp: 20141019212346
named_entities:
locations: North County, America, St. Louis County St. Louis County
Police St. Louis County, WordPress.com, Middle East, …
organizations: Twitter Facebook Google, Google, Facebook, Wal-Mart,
CNN, Bearcats, …
persons: Stell, Tom Jackson, Smith, Pamela Fillingim, Darren Wilson
Eric Fowler Eric Vickers Ferguson Ferguson, Ferguson, …
digest: sha1:747IKFWUCVQVXY7TX2NMYFL422T4TRQX
Extrahováno se Stanford Named Entity Recognizer (NER)
http://nlp.stanford.edu/software/CRF-NER.shtml
https://webarchive.jira.com/wiki/display/ARS/
WANE+Overview+and+Technical+Details
NameTag / CNES 2.0 | WANE?
http://ufal.mff.cuni.cz/nametag

https://ufal.mff.cuni.cz/cnec/cnec2.0
Open nsfw model
“This repo contains code for running Not Suitable for Work
(NSFW) classification deep neural network Caffe models. “
https://github.com/yahoo/open_nsfw/blob/master/
audio2text
NameTag / CNES 2.0 | WANE?
http://ufal.mff.cuni.cz/nametag

https://ufal.mff.cuni.cz/cnec/cnec2.0
Jak metadata zpřístupnit?
bulk data
bulk data v S3
API
webová služba
Co s metadaty?
vývoj formátů na webu
vývoj prolinkování webů
vývoj nsfw webů na doméně
vývoj poměru grafiky / textu na webu
vývoj web technologií
…
Oddělení archivace webu | ODIF | NK ČR
Vedoucí: Jaroslav Kvasnica
Kurátoři: Marie Haškovcová, Monika Holoubková, Markéta
Hrdličková
IT Operation: Rudolf.Kreibich@nkp.cz
webarchiv.cz
facebook.com/webarchivcz
slideshare.net/webarchivCZ
github.com/webarchivcz

Contenu connexe

Tendances

Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSawood Alam
 
Hacklu2012 v07
Hacklu2012 v07Hacklu2012 v07
Hacklu2012 v07F _
 
Attacking Network Infrastructure to Generate a 4 Tbs DDoS
Attacking Network Infrastructure to Generate a 4 Tbs DDoSAttacking Network Infrastructure to Generate a 4 Tbs DDoS
Attacking Network Infrastructure to Generate a 4 Tbs DDoSmark-smith
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Databutest
 
Digital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationDigital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationIan Mulvany
 
CITEC #CON2-Dirty Attack with Google Hacking
CITEC #CON2-Dirty Attack with Google HackingCITEC #CON2-Dirty Attack with Google Hacking
CITEC #CON2-Dirty Attack with Google HackingPrathan Phongthiproek
 
The Web, one huge database ...
The Web, one huge database ...The Web, one huge database ...
The Web, one huge database ...Michael Hausenblas
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesSawood Alam
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSawood Alam
 
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)nous sommes vivants
 
20190516 web security-basic
20190516 web security-basic20190516 web security-basic
20190516 web security-basicMksYi
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA KeynoteAxel Polleres
 

Tendances (17)

Supporting Web Archiving via Web Packaging
Supporting Web Archiving via Web PackagingSupporting Web Archiving via Web Packaging
Supporting Web Archiving via Web Packaging
 
CloudKit
CloudKitCloudKit
CloudKit
 
Hacklu2012 v07
Hacklu2012 v07Hacklu2012 v07
Hacklu2012 v07
 
Attacking Network Infrastructure to Generate a 4 Tbs DDoS
Attacking Network Infrastructure to Generate a 4 Tbs DDoSAttacking Network Infrastructure to Generate a 4 Tbs DDoS
Attacking Network Infrastructure to Generate a 4 Tbs DDoS
 
Google Hacking 101
Google Hacking 101Google Hacking 101
Google Hacking 101
 
URI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked DataURI Disambiguation in the Context of Linked Data
URI Disambiguation in the Context of Linked Data
 
Digital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea PresentationDigital Library Federation, Fall 07, Connotea Presentation
Digital Library Federation, Fall 07, Connotea Presentation
 
Deepweb Tools
Deepweb ToolsDeepweb Tools
Deepweb Tools
 
CITEC #CON2-Dirty Attack with Google Hacking
CITEC #CON2-Dirty Attack with Google HackingCITEC #CON2-Dirty Attack with Google Hacking
CITEC #CON2-Dirty Attack with Google Hacking
 
The Web, one huge database ...
The Web, one huge database ...The Web, one huge database ...
The Web, one huge database ...
 
Readying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web BundlesReadying Web Archives to Consume and Leverage Web Bundles
Readying Web Archives to Consume and Leverage Web Bundles
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
Summarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMapSummarize Your Archival Holdings With MementoMap
Summarize Your Archival Holdings With MementoMap
 
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
courts circuits : l'innovation dans le luxe 'mon idendité de luxe" (partie 3)
 
20190516 web security-basic
20190516 web security-basic20190516 web security-basic
20190516 web security-basic
 
Maphub and Annotorious
Maphub and AnnotoriousMaphub and Annotorious
Maphub and Annotorious
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA Keynote
 

En vedette

Aries Errand Service LLC - Bond docs - 10 13 16
Aries Errand Service LLC - Bond docs - 10 13 16Aries Errand Service LLC - Bond docs - 10 13 16
Aries Errand Service LLC - Bond docs - 10 13 16Joyce Stafford
 
Group presentation FOTM
Group presentation FOTMGroup presentation FOTM
Group presentation FOTMmartsu kichu
 
Biomas martha julia borrayo
Biomas martha julia borrayoBiomas martha julia borrayo
Biomas martha julia borrayomarthaaaaaaa
 
Crowd новый формат
Crowd новый форматCrowd новый формат
Crowd новый формат1PS.RU
 
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากรLoadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากรnawaporn khamseanwong
 
Codigo pa sql
Codigo pa sqlCodigo pa sql
Codigo pa sqlsigiandre
 
Simbolos patrios del perú
Simbolos patrios del perúSimbolos patrios del perú
Simbolos patrios del perúdaiell100
 

En vedette (14)

Aries Errand Service LLC - Bond docs - 10 13 16
Aries Errand Service LLC - Bond docs - 10 13 16Aries Errand Service LLC - Bond docs - 10 13 16
Aries Errand Service LLC - Bond docs - 10 13 16
 
Group presentation FOTM
Group presentation FOTMGroup presentation FOTM
Group presentation FOTM
 
El cerebro
El cerebroEl cerebro
El cerebro
 
Organismo y ambiente 2ºmedio
Organismo y ambiente 2ºmedioOrganismo y ambiente 2ºmedio
Organismo y ambiente 2ºmedio
 
LEAN w farmacji
LEAN w farmacjiLEAN w farmacji
LEAN w farmacji
 
Biomas martha julia borrayo
Biomas martha julia borrayoBiomas martha julia borrayo
Biomas martha julia borrayo
 
Par2 2 0901(1)
Par2 2 0901(1)Par2 2 0901(1)
Par2 2 0901(1)
 
Crowd новый формат
Crowd новый форматCrowd новый формат
Crowd новый формат
 
About page
About pageAbout page
About page
 
Lambda Expression
Lambda ExpressionLambda Expression
Lambda Expression
 
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากรLoadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
Loadแนวข้อสอบ นักโบราณคดี กรมศิลปากร
 
Codigo pa sql
Codigo pa sqlCodigo pa sql
Codigo pa sql
 
Simbolos patrios del perú
Simbolos patrios del perúSimbolos patrios del perú
Simbolos patrios del perú
 
Rio de Janeiro
Rio de JaneiroRio de Janeiro
Rio de Janeiro
 

Similaire à Mezi snem a realitou. Otevřená data českého webového archivu.

2017-07-22 Common Workflow Language Viewer
2017-07-22 Common Workflow Language Viewer2017-07-22 Common Workflow Language Viewer
2017-07-22 Common Workflow Language ViewerStian Soiland-Reyes
 
Organization
OrganizationOrganization
Organizationcat509
 
The Impact of Bibframe
The Impact of BibframeThe Impact of Bibframe
The Impact of BibframeThomas Meehan
 
APAN 50: RPKI industry trends and initiatives
APAN 50: RPKI industry trends and initiatives APAN 50: RPKI industry trends and initiatives
APAN 50: RPKI industry trends and initiatives APNIC
 
Web Browser Basics, Tips & Tricks Draft 17
Web Browser Basics, Tips & Tricks Draft 17Web Browser Basics, Tips & Tricks Draft 17
Web Browser Basics, Tips & Tricks Draft 17msz
 
Semantic web and Drupal: an introduction
Semantic web and Drupal: an introductionSemantic web and Drupal: an introduction
Semantic web and Drupal: an introductionKristof Van Tomme
 
Presentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferencePresentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferenceJohannes Keizer
 
Experiments in Data Portability 2
Experiments in Data Portability 2Experiments in Data Portability 2
Experiments in Data Portability 2Glenn Jones
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelabCAMELIA BOBAN
 
OWASP Free Training - SF2014 - Keary and Manico
OWASP Free Training - SF2014 - Keary and ManicoOWASP Free Training - SF2014 - Keary and Manico
OWASP Free Training - SF2014 - Keary and ManicoEoin Keary
 
Network Security Data Visualization
Network Security Data VisualizationNetwork Security Data Visualization
Network Security Data Visualizationamiable_indian
 
Presentation at the EMBL-EBI Industry RDF meeting
Presentation at the EMBL-EBI  Industry RDF meetingPresentation at the EMBL-EBI  Industry RDF meeting
Presentation at the EMBL-EBI Industry RDF meetingJohannes Keizer
 
RDFa Introductory Course Session 2/4 How RDFa
RDFa Introductory Course Session 2/4 How RDFaRDFa Introductory Course Session 2/4 How RDFa
RDFa Introductory Course Session 2/4 How RDFaPlatypus
 
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)msz
 

Similaire à Mezi snem a realitou. Otevřená data českého webového archivu. (20)

2017-07-22 Common Workflow Language Viewer
2017-07-22 Common Workflow Language Viewer2017-07-22 Common Workflow Language Viewer
2017-07-22 Common Workflow Language Viewer
 
Organization
OrganizationOrganization
Organization
 
The Impact of Bibframe
The Impact of BibframeThe Impact of Bibframe
The Impact of Bibframe
 
APAN 50: RPKI industry trends and initiatives
APAN 50: RPKI industry trends and initiatives APAN 50: RPKI industry trends and initiatives
APAN 50: RPKI industry trends and initiatives
 
Web Browser Basics, Tips & Tricks Draft 17
Web Browser Basics, Tips & Tricks Draft 17Web Browser Basics, Tips & Tricks Draft 17
Web Browser Basics, Tips & Tricks Draft 17
 
Romulus OWASP
Romulus OWASPRomulus OWASP
Romulus OWASP
 
Bio2RDF@BH2010
Bio2RDF@BH2010Bio2RDF@BH2010
Bio2RDF@BH2010
 
Semantic web and Drupal: an introduction
Semantic web and Drupal: an introductionSemantic web and Drupal: an introduction
Semantic web and Drupal: an introduction
 
Presentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conferencePresentation at the VIVO 2011 conference
Presentation at the VIVO 2011 conference
 
Experiments in Data Portability 2
Experiments in Data Portability 2Experiments in Data Portability 2
Experiments in Data Portability 2
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelab
 
OWASP Free Training - SF2014 - Keary and Manico
OWASP Free Training - SF2014 - Keary and ManicoOWASP Free Training - SF2014 - Keary and Manico
OWASP Free Training - SF2014 - Keary and Manico
 
URL Design
URL DesignURL Design
URL Design
 
Network Security Data Visualization
Network Security Data VisualizationNetwork Security Data Visualization
Network Security Data Visualization
 
RESTful Rabbits
RESTful RabbitsRESTful Rabbits
RESTful Rabbits
 
AGROVOC, AGRIS and the CIARD RING, using RDF vocabularies and technologies f...
AGROVOC, AGRIS and the CIARD RING,  using RDF vocabularies and technologies f...AGROVOC, AGRIS and the CIARD RING,  using RDF vocabularies and technologies f...
AGROVOC, AGRIS and the CIARD RING, using RDF vocabularies and technologies f...
 
Presentation at the EMBL-EBI Industry RDF meeting
Presentation at the EMBL-EBI  Industry RDF meetingPresentation at the EMBL-EBI  Industry RDF meeting
Presentation at the EMBL-EBI Industry RDF meeting
 
RDFa Introductory Course Session 2/4 How RDFa
RDFa Introductory Course Session 2/4 How RDFaRDFa Introductory Course Session 2/4 How RDFa
RDFa Introductory Course Session 2/4 How RDFa
 
How RDFa works
How RDFa worksHow RDFa works
How RDFa works
 
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
Web Browser Basics, Tips & Tricks - Draft 20 (Revised 5/18/17)
 

Plus de Webarchive of National Library of the Czech Republic

Plus de Webarchive of National Library of the Czech Republic (20)

Inzerat - datovy analytik / datova analyticka
Inzerat - datovy analytik / datova analyticka Inzerat - datovy analytik / datova analyticka
Inzerat - datovy analytik / datova analyticka
 
Inzerát datovy analytik_wa
Inzerát datovy analytik_waInzerát datovy analytik_wa
Inzerát datovy analytik_wa
 
Sys admin wa_rvv
Sys admin wa_rvvSys admin wa_rvv
Sys admin wa_rvv
 
Volné pracovní místo - kurátor/ka webového archivu
Volné pracovní místo - kurátor/ka webového archivuVolné pracovní místo - kurátor/ka webového archivu
Volné pracovní místo - kurátor/ka webového archivu
 
Webarchiv - Curatorial approaches, topic collections and cooperation with the...
Webarchiv - Curatorial approaches, topic collections and cooperation with the...Webarchiv - Curatorial approaches, topic collections and cooperation with the...
Webarchiv - Curatorial approaches, topic collections and cooperation with the...
 
Volné místo - analytik českého webového archivu
Volné místo - analytik českého webového archivuVolné místo - analytik českého webového archivu
Volné místo - analytik českého webového archivu
 
Webarchiv aneb až po lokty v mrtvolách
Webarchiv aneb až po lokty v mrtvoláchWebarchiv aneb až po lokty v mrtvolách
Webarchiv aneb až po lokty v mrtvolách
 
Kurz webové archivace 2018/2
Kurz webové archivace 2018/2Kurz webové archivace 2018/2
Kurz webové archivace 2018/2
 
Blok expertu
Blok expertuBlok expertu
Blok expertu
 
Kurz webové archivace 2018/1
Kurz webové archivace 2018/1Kurz webové archivace 2018/1
Kurz webové archivace 2018/1
 
Webarchiv
WebarchivWebarchiv
Webarchiv
 
Datovy analytik
Datovy analytikDatovy analytik
Datovy analytik
 
Webarchiv CZ 2017
Webarchiv CZ 2017Webarchiv CZ 2017
Webarchiv CZ 2017
 
Kurz webové archivace 2017/4
Kurz webové archivace 2017/4Kurz webové archivace 2017/4
Kurz webové archivace 2017/4
 
Kurz webové archivace 2017/3
Kurz webové archivace 2017/3Kurz webové archivace 2017/3
Kurz webové archivace 2017/3
 
Kurz webové archivace 2017/2
Kurz webové archivace 2017/2Kurz webové archivace 2017/2
Kurz webové archivace 2017/2
 
Kurz webové archivace 2017/1
Kurz webové archivace 2017/1Kurz webové archivace 2017/1
Kurz webové archivace 2017/1
 
Tematické kolekce jako měřítko kvality webových archivů
Tematické kolekce jako měřítko kvality webových archivůTematické kolekce jako měřítko kvality webových archivů
Tematické kolekce jako měřítko kvality webových archivů
 
WARC 1.1 je skoro tady - co přinese nová verze?
WARC 1.1 je skoro tady - co přinese nová verze?WARC 1.1 je skoro tady - co přinese nová verze?
WARC 1.1 je skoro tady - co přinese nová verze?
 
WARC 1.1 je skoro tady - co přinese nová verze
WARC 1.1 je skoro tady - co přinese nová verzeWARC 1.1 je skoro tady - co přinese nová verze
WARC 1.1 je skoro tady - co přinese nová verze
 

Dernier

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Dernier (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Mezi snem a realitou. Otevřená data českého webového archivu.

  • 1. Webarchiv Památník českého internetu, více OpenAlt 2016 Mezi snem a realitou. Otevřená data českého webového archivu. http://www.slideshare.net/webarchivCZ/presentations
  • 2. Proč archivujeme web? Kdo a jak archivuje web? Metadata Rudolf.Kreibich@nkp.cz vedoucí podpory aplikací NK ČR
  • 4.
  • 5. “… více jak 70% URL v Harvard Law Review a 50% URL v nálezích nejvyššího soudu Spojených států amerických, odkazuje k již neexistujícímu webovému zdroji. “ Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations. Jonathan Zittrain, Kendra Albert a Lawrence Lessig. Legal Information Management / Volume 14 / Issue 02 / June 2014, pp 88-99, DOI: http://dx.doi.org/10.1017/S1472669614000255, Published online: 12 June 2014
  • 6.
  • 7. 404 Not Found The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists. A 404 status code does not indicate whether this lack of representation is temporary or permanent; the 410 (Gone) status code is preferred over 404 if the origin server knows, presumably through some configurable means, that the condition is likely to be permanent. A 404 response is cacheable by default; i.e., unless otherwise indicated by the method definition or explicit cache controls (see Section 4.2.2 of [RFC7234]).
  • 9. “Je snažší nalézt exemplář filmu z roku 1924, než webové stránky z roku 1994.” M.S. Ankerson. “Writing web histories with an eye on the analog past.” 2012. 
 http://nms.sagepub.com/content/14/3/384.full.pdf+html
  • 10. “Bude možné studovat naše století bez webových archivů?” Ian Milligan, Professor in the Department of History at the University of Waterloo.
  • 11. Kdo a jak archivuje web?
  • 12.
  • 13. “Univerzální dostupnost veškerého vědění.” Brewster Kahle
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. IIPC | Internationl Internet Preservation Consortium Složení členů 2x Regionální knihovny 32x Národní knihovny (včetně ČR) 3x Neziskové organizace 9x Výzkumné organizace nebo univerzity http://netpreserve.org/about-us/members
  • 19. Heritrix / OpenWayback sklízení / zpřístupnění Otevřený software Mezinárodní komunita https://github.com/iipc/openwayback https://github.com/internetarchive/heritrix3
  • 20.
  • 21. Temný věk Java Scriptu “Brozzler is a distributed web crawler (爬⾍) that uses a real browser (chrome or chromium) to fetch pages and embedded urls and to extract links.” https://github.com/internetarchive/brozzler
  • 22. Heritrix sklízí 2065 URL/s PhantomJS sklízí 172 URL/s => škálovat JS intepretory
  • 23.
  • 24.
  • 25. Měsíční výběrové sklizně Občasné tématické sklizně Půl roční sklizně domény cz (spolupráce s nic.cz)
  • 26. … od roku 2001 ~ 221 TB ~ 6 miliard digitálních objektů / URL ~1,2 miliónu domén .cz
  • 27.
  • 28. méně než 1 % je volně přístupné = ~ 4738 webů z 1,2 miliónu webů
  • 29.
  • 30.
  • 31. Operation | postupný přesun do Infrastructre as Code Dobrá strana síly Ansible Vagrant Packer Docker? … Temná a svůdná strana VMware vCenter IBM GPFS
  • 33.
  • 34. “The Common Crawl corpus contains petabytes of data collected over the last 7 years. It contains raw web page data, extracted metadata and text extractions. The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. From Public Data Sets, you can download the files entirely free using HTTP or S3. As the Common Crawl Foundation has evolved over the years, so has the format and metadata that accompany the crawls themselves.” http://commoncrawl.org/the-data/get-started/
  • 35.
  • 36. “Google podle mně nearchivuje, ale cachuje.” já, u vícero příležitostí
  • 38. WARC | ISO 28500:2009 | Prochází revizí WARC/1.0 WARC-Type: response WARC-Date: 2014-08-02T09:52:13Z WARC-Record-ID: Content-Length: 43428 Content-Type: application/http; msgtype=response WARC-Warcinfo-ID: WARC-Concurrent-To: WARC-IP-Address: 212.58.244.61 WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3J WARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJO WARC-Truncated: length
  • 39. Wayback CDX Server API plain text or JSON array of the CDX data urlkey: org,archive timestamp: 19970126045828 original: http://www.archive.org:80 mimetype: text/html statuscode: 200 digest: Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY length: 1415 https://github.com/internetarchive/wayback/blob/master/wayback- cdx-server/README.md
  • 40. WAT | Metadata k archivovaným objektům | JSON WARC-Header-Metadata: WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Type: response WARC-Date 2014-08-02T09:52:13Z … Payload-Metadata: HTTP-Response-Metadata: Headers: Content-Language: Content-Encoding: ... HTML-Metadata: Head: Title: BBC NEWS | Africa | Namibia braces for Nujoma exit … Metas: name: keywords content: BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service … Links: href: /css/screen/shared/styles.css path: STYLE/#text … http://commoncrawl.org/the-data/get-started/
 https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar
  • 41. WAT | Metadata k archivovaným objektům | JSON Server response "Headers" : { "Date" : "Sat, 02 Aug 2014 09:52:13 GMT", "Cache-Control" : "max-age=0", "Connection" : "close", "Expires" : "Sat, 02 Aug 2014 09:52:13 GMT", "Content-Type" : "text/html", "Server" : "Apache", "Vary" : "X-CDN", "Set-Cookie" : “BBC UID=15730d9c1b741c0d3942e2aca1317fbf39e57b90be68a329d375ba9d5 a8964080CCBot%2f2%2e0%20%28http%3a%2f%2fcommoncrawl%2eorg%2ffaq %2f%29; expires=Sun, 02-Aug-15 09:52:13 GMT; path=/; domain=bbc.co.uk;" http://commoncrawl.org/the-data/get-started/
 https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat https://webarchive.jira.com/wiki/display/Iresearch/archive-metadata-extractor.jar
  • 42. WET | Extrahovaný fulltext WARC/1.0 WARC-Type: conversion WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stm WARC-Date: 2014-08-02T09:52:13Z WARC-Record-ID: <urn:uuid:007d632a-ab5a-4c4e-afc2-c455066a82de> WARC-Refers-To: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f> WARC-Block-Digest: sha1:JROHLCS5SKMBR6XY46WXREW7RXM64EJC Content-Type: text/plain Content-Length: 6724 BBC NEWS | Africa | Namibia braces for Nujoma exit [an error occurred while processing this directive] … Your news when you want it News Front Page Africa … HausaPortuguese Africa More Last Updated: Thursday, 22 January, 2004, 00:48 GMT E-mail this to a friend Printable version … Swapo has been careful to secure the Ovambo vote by ploughing a large slice of development funding into the region, and the people there get more than their fair share of government positions. For the moment, Mr Nujoma's biggest headache is land reform. Huge tracks of land are still owned by a few white farmers and black Namibians are impatient at the slow pace of reform. White farmers say they are falling over backwards to please the government, but Mr Pahamba says that they are only handing over poor quality land. Meanwhile, the militant black farmer's union is threatening farm occupations similar to those in Zimbabwe. Guard dogs … http://commoncrawl.org/the-data/get-started/
 https://gist.github.com/Smerity/e750f0ef0ab9aa366558#file-bbc-pretty-wat
  • 43. LGA | Metadata pro vztahy mezi URL v čase ID-Map url: https://www.youtube.com/watch?v=--FDzShdFjw&gl=US&hl=en surt_url: com,youtube)/watch?gl=us&hl=en&v=–fdzshdfjw id: 294869 
 příklad
 {"url":"https://www.youtube.com/watch?v=-- FDzShdFjw&gl=US&hl=en","surt_url":"com,youtube)/watch?gl=us&hl=en&v=– fdzshdfjw","id":294869} ID-Graph timestamp: 20150209052911 id: 20150209052911 outilink_ids: 31, 31366, 62596, 91594, 91595, … 
 příklad {“timestamp":"20150209052911","id":294869,"outlink_ids": [31,31366,62596,91594,91595,129599, …]} 
 https://webarchive.jira.com/wiki/display/ARS/LGA+Overview+and+Technical+Details
  • 44. WANE | Extrahované jmenné entity url: http://dissonantwinstonsmith.wordpress.com/2014/08/24/im-sick-of/? like_comment=79&_wpnonce=0fc57aa499&replytocom=93 timestamp: 20141019212346 named_entities: locations: North County, America, St. Louis County St. Louis County Police St. Louis County, WordPress.com, Middle East, … organizations: Twitter Facebook Google, Google, Facebook, Wal-Mart, CNN, Bearcats, … persons: Stell, Tom Jackson, Smith, Pamela Fillingim, Darren Wilson Eric Fowler Eric Vickers Ferguson Ferguson, Ferguson, … digest: sha1:747IKFWUCVQVXY7TX2NMYFL422T4TRQX Extrahováno se Stanford Named Entity Recognizer (NER) http://nlp.stanford.edu/software/CRF-NER.shtml https://webarchive.jira.com/wiki/display/ARS/ WANE+Overview+and+Technical+Details
  • 45. NameTag / CNES 2.0 | WANE? http://ufal.mff.cuni.cz/nametag
 https://ufal.mff.cuni.cz/cnec/cnec2.0
  • 46.
  • 47. Open nsfw model “This repo contains code for running Not Suitable for Work (NSFW) classification deep neural network Caffe models. “ https://github.com/yahoo/open_nsfw/blob/master/
  • 48.
  • 50. NameTag / CNES 2.0 | WANE? http://ufal.mff.cuni.cz/nametag
 https://ufal.mff.cuni.cz/cnec/cnec2.0
  • 51.
  • 52.
  • 53. Jak metadata zpřístupnit? bulk data bulk data v S3 API webová služba
  • 54. Co s metadaty? vývoj formátů na webu vývoj prolinkování webů vývoj nsfw webů na doméně vývoj poměru grafiky / textu na webu vývoj web technologií …
  • 55. Oddělení archivace webu | ODIF | NK ČR Vedoucí: Jaroslav Kvasnica Kurátoři: Marie Haškovcová, Monika Holoubková, Markéta Hrdličková IT Operation: Rudolf.Kreibich@nkp.cz webarchiv.cz facebook.com/webarchivcz slideshare.net/webarchivCZ github.com/webarchivcz