SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Slide 1
International Internet Preservation Consortium
General Assembly 2014, Paris
Mining a Large Web Corpus
Robert Meusel
Christian Bizer
Slide 2
The Common Crawl
Slide 3
Hyperlink Graphs
Knowledge about the structure of the Web can be used to
improve crawling strategies, to help SEO experts or to
understand social phenomena.
Slide 4
HTML-embedded Data on the Web
Several million websites semantically markup the content of
their HTML pages.
Markup Syntaxes
 Microformats
 RDFa
 Microdata
Data snippets
within info boxes
Slide 5
Relational HTML Tables
HTML Tables over semi-structured data which can be used to
build up or extend knowledge bases as DBPedia.
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
 In a corpus of 14B raw
tables, 154M are „good“
relations (1.1%)
Slide 6
The Web Data Commons Project
 Has developed an Amazon-based framework for extracting data
from large web crawls
 Capable to run on any cloud infrastructure
 Has applied this framework to the Common Crawl data
 Adaptable to other crawls
 Results and framework are publicly available
 http://webdatacommons.org
Goal: Offer an easy-to-use, cost efficient, distributed
extraction framework for large web crawls, as well as
datasets extracted out of the crawls.
Slide 7
Extraction Framework
AWS EC2
Instance
AWS EC2
Instance
Master
AWS SQS
AWS EC2
Instance
AWS S3
1: Fill queue
2: Launch instances
3: Request
file-reference
4: Download file
5: Extract &
Upload
automated
manual
6: Collect results
Slide 8
Extraction Worker
AWS S3
AWS S3
WDC Extractor
.(w)arc
Worker
Filter
output
Worker:
• Written in Java
• Process one page at
once
• Independent from
other files and
workers
Download file
Upload output file
Filter:
• Reduce Runtime
• Mime-Type filter
• Regex detection of
content or meta-
information
Worker
Slide 9
Web Data Commons – Extraction Framework
 Written in Java
 Mainly tailored for Amazon Web Services
 Fault tolerant and cheap
 300 USD to extract 17 billion RDF statements from 44 TB
 Easy customizable
 Only worker has to be adapted
 Worker is a single process method processing one file each time
 Scaling is automated by the framework
 Access Open Source Code:
 https://www.assembla.com/code/commondata/
Alternative: Hadoop Version, which can run on any Hadoop
cluster without Amazon Web Services.
Slide 10
Extracted Datasets
 Hyperlink Graph
 HTML-embedded Data
 Relational HTML Tables
Hyperlink Graph
HTML-embedded Data
Relational HTML Tables
Slide 11
Hyperlink Graph
 Extracted from the Common Crawl 2012 Dataset
 Over 3.5 billion pages connected by over 128 billion links
 Graph files: 386 GB
http://webdatacommons.org/hyperlinkgraph/
http://wwwranking.webdatacommons.org/
Slide 12
Hyperlink Graph
 Degrees do not follow a power-law
 Detection of Spam pages
 Further insights:
 WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)
 WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.)
Discovery of evolutions in the global structure of the World
Wide Web.
Slide 13
Hyperlink Graph
Discovery of important and interesting sites using different
popularity rankings or website categorization libraries
Websites connected by at least ½ Million Links
Slide 14
HTML-embedded Data
More and more Websites semantically
markup the content of their HTML pages.
Markup Syntaxes
RDFa
Microformats
Microdata
Slide 15
Websites containing Structured Data (2013)
1.8 million websites (PLDs) out of 12.8 million
provide Microformat, Microdata or RDFa data (13.9%)
585 million of the 2.2 billion pages contain
Microformat, Microdata or RDFa data (26.3%).
Web Data Commons - Microformat, Microdata, RDFa Corpus
 17 billion RDF triples from Common Crawl 2013
 Next release will be in winter 2014
http://webdatacommons.org/structureddata/
Slide 16
Top Classes Microdata (2013)
• schema = Schema.org
• dv = Google‘s
Rich Snippet Vocabulary
Slide 17
HTML Tables
• Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.
• Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011.
In corpus of 14B raw tables, 154M are “good” relations (1.1%).
Cafarella (2008)
Classification Precision: 70-80%
Slide 18
WDC - Web Tables Corpus
 Large corpus of relational Web tables for public download
 Extracted from Common Crawl 2012 (3.3 billion pages)
 147 million relational tables
 selected out of 11.2 B raw tables (1.3%)
 download includes the HTML pages of the tables (1TB zipped)
 Table Statistics
 Heterogeneity: Very high.
http://webdatacommons.org/webtables/
Min Max Average Median
Attributes 2 2,368 3.49 3
Data Rows 1 70,068 12.41 6
Slide 19
 Attribute Statistics
28,000,000 different attribute labels
WDC - Web Tables Corpus
Attribute #Tables
name 4,600,000
price 3,700,000
date 2,700,000
artist 2,100,000
location 1,200,000
year 1,000,000
manufacturer 375,000
counrty 340,000
isbn 99,000
area 95,000
population 86,000
 Subject Attribute Values
1.74 billion rows
253,000,000 different subject labels
Value #Rows
usa 135,000
germany 91,000
greece 42,000
new york 59,000
london 37,000
athens 11,000
david beckham 3,000
ronaldinho 1,200
oliver kahn 710
twist shout 2,000
yellow submarine 1,400
Slide 20
Conclusion
Three factors are necessary to work with web-scale data:
 Thanks to Common Crawl, this data is available
 Like Amazon or other on-demand cloud-services
 The Web Data Commons Framework, or standard tools like Pig
 Cost evaluation on task-base, but the WDC framework has turned
out to be cheaper
Availability of Crawls
Availability of cheap, easy-to-use infrastructures
Easy to adopt scalable extraction frameworks
Slide 21
Questions
 Please visit our website: www.webdatacommons.org
 Data and Framework are available as free download
 Web Data Commons is supported by:

Contenu connexe

Tendances

stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with WeaviateNETWAYS
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchNeo4j
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databasesArangoDB Database
 
Negotiating crawl budget with googlebots
Negotiating crawl budget with googlebotsNegotiating crawl budget with googlebots
Negotiating crawl budget with googlebotsDawn Anderson MSc DigM
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOKoray Tugberk GUBUR
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAnkur Biswas
 
SGE, New Features in Google Search & How to Respond.pdf
SGE, New Features in Google Search & How to Respond.pdfSGE, New Features in Google Search & How to Respond.pdf
SGE, New Features in Google Search & How to Respond.pdfLily Ray
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsPeter Haase
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Google Knowledge Graph
Google Knowledge GraphGoogle Knowledge Graph
Google Knowledge Graphkarthikzinavo
 
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOSemantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOKoray Tugberk GUBUR
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupTushar Mittal
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge GraphsPeter Haase
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlPrimal Pappachan
 

Tendances (20)

Semantic search
Semantic searchSemantic search
Semantic search
 
An Introduction to Druid
An Introduction to DruidAn Introduction to Druid
An Introduction to Druid
 
stackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviatestackconf 2022: Introduction to Vector Search with Weaviate
stackconf 2022: Introduction to Vector Search with Weaviate
 
Knowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based SearchKnowledge Graphs - The Power of Graph-Based Search
Knowledge Graphs - The Power of Graph-Based Search
 
Introduction to column oriented databases
Introduction to column oriented databasesIntroduction to column oriented databases
Introduction to column oriented databases
 
Negotiating crawl budget with googlebots
Negotiating crawl budget with googlebotsNegotiating crawl budget with googlebots
Negotiating crawl budget with googlebots
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEO
 
RDF Data Model
RDF Data ModelRDF Data Model
RDF Data Model
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
An Introduction to Semantic Web Technology
An Introduction to Semantic Web TechnologyAn Introduction to Semantic Web Technology
An Introduction to Semantic Web Technology
 
Semantic web
Semantic webSemantic web
Semantic web
 
SGE, New Features in Google Search & How to Respond.pdf
SGE, New Features in Google Search & How to Respond.pdfSGE, New Features in Google Search & How to Respond.pdf
SGE, New Features in Google Search & How to Respond.pdf
 
ESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge GraphsESWC 2017 Tutorial Knowledge Graphs
ESWC 2017 Tutorial Knowledge Graphs
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Google Knowledge Graph
Google Knowledge GraphGoogle Knowledge Graph
Google Knowledge Graph
 
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEOSemantic Content Networks - Ranking Websites on Google with Semantic SEO
Semantic Content Networks - Ranking Websites on Google with Semantic SEO
 
Introduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful SoupIntroduction to Web Scraping using Python and Beautiful Soup
Introduction to Web Scraping using Python and Beautiful Soup
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 
Cenitpede: Analyzing Webcrawl
Cenitpede: Analyzing WebcrawlCenitpede: Analyzing Webcrawl
Cenitpede: Analyzing Webcrawl
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 

En vedette

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012Amazon Web Services
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Datahuguk
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your datasetTuri, Inc.
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...Robert Meusel
 
2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortarOpen Analytics
 
Scaling Credible Content
Scaling Credible ContentScaling Credible Content
Scaling Credible ContentJoe Griffin
 
Marketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent CsutorasMarketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent CsutorasSearch Engine Journal
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterCommonCrawl
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesPromptCloud
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering projectHoa Nguyen
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersAlbert Hui
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisAnton Chuvakin
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text MiningYi-Shin Chen
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudAmazon Web Services
 

En vedette (17)

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
 
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
A Web-scale Study of the Adoption and Evolution of the schema.org Vocabulary ...
 
2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar2013 open analytics-meetup-mortar
2013 open analytics-meetup-mortar
 
Scaling Credible Content
Scaling Credible ContentScaling Credible Content
Scaling Credible Content
 
Marketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent CsutorasMarketing in an Altered Reality World with Brent Csutoras
Marketing in an Altered Reality World with Brent Csutoras
 
The Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecterThe Switchabalizer - our journey from spell checker to homophone corrrecter
The Switchabalizer - our journey from spell checker to homophone corrrecter
 
Is Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal PoliciesIs Crawling Legal? Web Crawling legal Policies
Is Crawling Legal? Web Crawling legal Policies
 
Insight Data Engineering project
Insight Data Engineering projectInsight Data Engineering project
Insight Data Engineering project
 
Clickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customersClickstream Data Warehouse - Turning clicks into customers
Clickstream Data Warehouse - Turning clicks into customers
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
 
IBM Open Data
IBM Open DataIBM Open Data
IBM Open Data
 
Quick Tour of Text Mining
Quick Tour of Text MiningQuick Tour of Text Mining
Quick Tour of Text Mining
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Migrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the CloudMigrating Large Scale Data Sets to the Cloud
Migrating Large Scale Data Sets to the Cloud
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Similaire à Mining a Large Web Corpus

Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107皓仁 柯
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...Nurhazman Abdul Aziz
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014Robert Meusel
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Webis20090
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Anna Shymchenko
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approachesLuxoft
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET Journal
 
(More) Transparency Transformation
(More) Transparency Transformation(More) Transparency Transformation
(More) Transparency TransformationGeorge Thomas
 
Web Technology Trends (early 2009)
Web Technology Trends (early 2009)Web Technology Trends (early 2009)
Web Technology Trends (early 2009)Prodosh Banerjee
 
Strategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologiesStrategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologiesHéctor Ugarte
 

Similaire à Mining a Large Web Corpus (20)

Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...Applications of xml, semantic web or linked data in Library/Information Servi...
Applications of xml, semantic web or linked data in Library/Information Servi...
 
disertation
disertationdisertation
disertation
 
The Social Data Web
The Social Data WebThe Social Data Web
The Social Data Web
 
Gt ea2009
Gt ea2009Gt ea2009
Gt ea2009
 
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
The Web Data Commons Microdata, RDFa, and Microformat Dataset Series @ ISWC2014
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»Владимир Слободянюк «DWH & BigData – architecture approaches»
Владимир Слободянюк «DWH & BigData – architecture approaches»
 
DWH & big data architecture approaches
DWH & big data architecture approachesDWH & big data architecture approaches
DWH & big data architecture approaches
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
Linked Data
Linked DataLinked Data
Linked Data
 
Big Data
Big DataBig Data
Big Data
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N Basavarajappa
 
(More) Transparency Transformation
(More) Transparency Transformation(More) Transparency Transformation
(More) Transparency Transformation
 
mx & dbs
mx & dbsmx & dbs
mx & dbs
 
Web Technology Trends (early 2009)
Web Technology Trends (early 2009)Web Technology Trends (early 2009)
Web Technology Trends (early 2009)
 
Strategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologiesStrategies for integrating semantic and blockchain technologies
Strategies for integrating semantic and blockchain technologies
 

Dernier

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectssuserb6619e
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdfCaalaaAbdulkerim
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate productionChinnuNinan
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfChristianCDAM
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptbibisarnayak0
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communicationpanditadesh123
 

Dernier (20)

IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in projectDM Pillar Training Manual.ppt will be useful in deploying TPM in project
DM Pillar Training Manual.ppt will be useful in deploying TPM in project
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Research Methodology for Engineering pdf
Research Methodology for Engineering pdfResearch Methodology for Engineering pdf
Research Methodology for Engineering pdf
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Crushers to screens in aggregate production
Crushers to screens in aggregate productionCrushers to screens in aggregate production
Crushers to screens in aggregate production
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Ch10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdfCh10-Global Supply Chain - Cadena de Suministro.pdf
Ch10-Global Supply Chain - Cadena de Suministro.pdf
 
Autonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.pptAutonomous emergency braking system (aeb) ppt.ppt
Autonomous emergency braking system (aeb) ppt.ppt
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
multiple access in wireless communication
multiple access in wireless communicationmultiple access in wireless communication
multiple access in wireless communication
 

Mining a Large Web Corpus

  • 1. Slide 1 International Internet Preservation Consortium General Assembly 2014, Paris Mining a Large Web Corpus Robert Meusel Christian Bizer
  • 3. Slide 3 Hyperlink Graphs Knowledge about the structure of the Web can be used to improve crawling strategies, to help SEO experts or to understand social phenomena.
  • 4. Slide 4 HTML-embedded Data on the Web Several million websites semantically markup the content of their HTML pages. Markup Syntaxes  Microformats  RDFa  Microdata Data snippets within info boxes
  • 5. Slide 5 Relational HTML Tables HTML Tables over semi-structured data which can be used to build up or extend knowledge bases as DBPedia. • Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008.  In a corpus of 14B raw tables, 154M are „good“ relations (1.1%)
  • 6. Slide 6 The Web Data Commons Project  Has developed an Amazon-based framework for extracting data from large web crawls  Capable to run on any cloud infrastructure  Has applied this framework to the Common Crawl data  Adaptable to other crawls  Results and framework are publicly available  http://webdatacommons.org Goal: Offer an easy-to-use, cost efficient, distributed extraction framework for large web crawls, as well as datasets extracted out of the crawls.
  • 7. Slide 7 Extraction Framework AWS EC2 Instance AWS EC2 Instance Master AWS SQS AWS EC2 Instance AWS S3 1: Fill queue 2: Launch instances 3: Request file-reference 4: Download file 5: Extract & Upload automated manual 6: Collect results
  • 8. Slide 8 Extraction Worker AWS S3 AWS S3 WDC Extractor .(w)arc Worker Filter output Worker: • Written in Java • Process one page at once • Independent from other files and workers Download file Upload output file Filter: • Reduce Runtime • Mime-Type filter • Regex detection of content or meta- information Worker
  • 9. Slide 9 Web Data Commons – Extraction Framework  Written in Java  Mainly tailored for Amazon Web Services  Fault tolerant and cheap  300 USD to extract 17 billion RDF statements from 44 TB  Easy customizable  Only worker has to be adapted  Worker is a single process method processing one file each time  Scaling is automated by the framework  Access Open Source Code:  https://www.assembla.com/code/commondata/ Alternative: Hadoop Version, which can run on any Hadoop cluster without Amazon Web Services.
  • 10. Slide 10 Extracted Datasets  Hyperlink Graph  HTML-embedded Data  Relational HTML Tables Hyperlink Graph HTML-embedded Data Relational HTML Tables
  • 11. Slide 11 Hyperlink Graph  Extracted from the Common Crawl 2012 Dataset  Over 3.5 billion pages connected by over 128 billion links  Graph files: 386 GB http://webdatacommons.org/hyperlinkgraph/ http://wwwranking.webdatacommons.org/
  • 12. Slide 12 Hyperlink Graph  Degrees do not follow a power-law  Detection of Spam pages  Further insights:  WWW‘14: Graph Structure in the Web – Revisited (Meusel et al.)  WebSci‘14: The Graph Structure of the Web aggregated by Pay-Level Domain (Lehmberg et al.) Discovery of evolutions in the global structure of the World Wide Web.
  • 13. Slide 13 Hyperlink Graph Discovery of important and interesting sites using different popularity rankings or website categorization libraries Websites connected by at least ½ Million Links
  • 14. Slide 14 HTML-embedded Data More and more Websites semantically markup the content of their HTML pages. Markup Syntaxes RDFa Microformats Microdata
  • 15. Slide 15 Websites containing Structured Data (2013) 1.8 million websites (PLDs) out of 12.8 million provide Microformat, Microdata or RDFa data (13.9%) 585 million of the 2.2 billion pages contain Microformat, Microdata or RDFa data (26.3%). Web Data Commons - Microformat, Microdata, RDFa Corpus  17 billion RDF triples from Common Crawl 2013  Next release will be in winter 2014 http://webdatacommons.org/structureddata/
  • 16. Slide 16 Top Classes Microdata (2013) • schema = Schema.org • dv = Google‘s Rich Snippet Vocabulary
  • 17. Slide 17 HTML Tables • Cafarella, et al.: WebTables: Exploring the Power of Tables on the Web. VLDB 2008. • Crestan, Pantel: Web-Scale Table Census and Classification. WSDM 2011. In corpus of 14B raw tables, 154M are “good” relations (1.1%). Cafarella (2008) Classification Precision: 70-80%
  • 18. Slide 18 WDC - Web Tables Corpus  Large corpus of relational Web tables for public download  Extracted from Common Crawl 2012 (3.3 billion pages)  147 million relational tables  selected out of 11.2 B raw tables (1.3%)  download includes the HTML pages of the tables (1TB zipped)  Table Statistics  Heterogeneity: Very high. http://webdatacommons.org/webtables/ Min Max Average Median Attributes 2 2,368 3.49 3 Data Rows 1 70,068 12.41 6
  • 19. Slide 19  Attribute Statistics 28,000,000 different attribute labels WDC - Web Tables Corpus Attribute #Tables name 4,600,000 price 3,700,000 date 2,700,000 artist 2,100,000 location 1,200,000 year 1,000,000 manufacturer 375,000 counrty 340,000 isbn 99,000 area 95,000 population 86,000  Subject Attribute Values 1.74 billion rows 253,000,000 different subject labels Value #Rows usa 135,000 germany 91,000 greece 42,000 new york 59,000 london 37,000 athens 11,000 david beckham 3,000 ronaldinho 1,200 oliver kahn 710 twist shout 2,000 yellow submarine 1,400
  • 20. Slide 20 Conclusion Three factors are necessary to work with web-scale data:  Thanks to Common Crawl, this data is available  Like Amazon or other on-demand cloud-services  The Web Data Commons Framework, or standard tools like Pig  Cost evaluation on task-base, but the WDC framework has turned out to be cheaper Availability of Crawls Availability of cheap, easy-to-use infrastructures Easy to adopt scalable extraction frameworks
  • 21. Slide 21 Questions  Please visit our website: www.webdatacommons.org  Data and Framework are available as free download  Web Data Commons is supported by: