SlideShare une entreprise Scribd logo
1  sur  24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 1
Graph Structure in the Web
Revisited
Robert Meusel, Sebastiano Vigna,
Oliver Lehmberg, Christian Bizer
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 2
Textbook Knowledge about the Web Graph
Broder et al.: Graph structure in the Web. WWW2000.
used two AltaVista crawls (200 million pages, 1.5 billion links)
Results
Power Laws Bow-Tie
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 3
This talk will:
1. Show that the textbook knowledge might
be wrong or dependent on crawling process.
2. Provide you with a large recent Web graph
to do further research.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 4
Outline
1. Public Web Crawls
2. The Web Data Commons Hyperlink Graph
3. Analysis of the Graph
1. In-degree & Out-degree Distributions
2. Node Centrality
3. Strong Components
4. Bow Tie
5. Reachability and Average Shortest Path
4. Conclusion
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 5
Public Web Crawls
1. AltaVista Crawl distributed by Yahoo! WebScope 2002
• Size: 1.4 billion pages
• Problem: Largest strongly connected component 4%
2. ClueWeb 2009
• Size: 1 billion pages
• Problem: Largest strongly connected component 3%
3. ClueWeb 2012
• Size: 733 million pages
• Largest strongly connected component 76%
• Problem: Only English pages
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 6
The Common Crawl
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 7
The Common Crawl Foundation
Regularly publishes Web crawls on Amazon S3.
Five crawls available so far:
Crawling Strategy (Spring 2012)
• breadth-first visiting strategy
• at least 71 million seeds from previous crawls and from Wikipedia
Date # Pages
2010 2.5 billion
Spring 2012 3.5 billion
Spring 2013 2.0 billion
Winter 2013 2.0 billion
Spring 2014 2.5 billion
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 8
Web Data Commons – Hyperlink Graph
extracted from the Spring 2012 version of the Common Crawl
size
3.5 billion nodes
128 billion arcs
pages originate from 43 million pay-level domains (PLDs)
• 240 million PLDs were registered in 2012 * (18%)
world-wide coverage
* http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 9
Downloading the WDC Hyperlink Graph
http://webdatacommons.org/hyperlinkgraph/
4 aggregation levels:
Extraction code is published under Apache License
• Extraction costs per run: ~ 200 US$ in Amazon EC2 fees
Graph #Nodes #Arcs Size (zipped)
Page graph 3.56 billion 128.73 billion 376 GB
Subdomain graph 101 million 2,043 million 10 GB
1st level subdomain graph 95 million 1,937 million 9.5 GB
PLD graph 43 million 623 million 3.1 GB
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 10
Analysis of the Graph
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 11
In-Degree Distribution
Broder et al. (2000)
Power law with exponent 2.1
WDC Hyperlink Graph (2012)
Best power law exponent 2.24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 12
In-Degree Distribution
Power law
fitted using
plfit-tool.
Maximum
likelihood
fitting.
Starting
degree:
1129
Best power
law exponent:
2.24
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 13
Goodness of Fit Test
Method
• Clauset et al.:
Power-Law Distributions in Empirical Data. SIAM Review 2009.
• p-value < 0.1  power law not a plausible hypothesis
Goodness of fit result
• p-value = 0
Conclusions:
• in-degree does not follow power law
• in-degree has non-fat heavy-tailed distribution
• maybe log-normal?
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 14
Out-Degree Distribution
Broder et al.:
Power law
exponent 2.78
WDC:
Best power law
exponent 2.77
p-value = 0
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 15
Node Centrality
http://wwwranking.webdatacommons.org
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 16
Average Degree
Broder et al. 2000: 7.5
WDC 2012: 36.8
 Factor 4.9 larger
Possible explanation: HTML templates of CMS
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 17
Strongly Connected Components
Calculated using WebGraph framework on a machine with 1 TB RAM.
Largest SCC
Broder: 27.7%
WDC: 51.3 %
 Factor 1.8 larger
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 18
The Bow-Tie Structure of Broder et al. 2000
Balanced size of IN and OUT: 21%
Size of LSCC: 27%
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 19
The Bow-Tie Structure of WDC Hyperlinkgraph 2012
IN much larger than OUT: 31% vs. 6%
LSCC much larger: 51%
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 20
Zhu et al. WWW2008
The Chinese web looks like a tea-pot.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 21
Reachability and Average Shortest Path
Broder et al. 2000
Pairs of pages connected by
path: 25%
Average shortest path: 16.12
WDC Webgraph 2012
Pairs of pages connected by
path: 48%
Average shortest path: 12.84
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 22
Conclusions
1. Web has become more dense and more connected
• Average degree has grown significantly in last 13 years (factor 5)
• Connectivity between pairs of pages has doubled
2. Macroscopic structure
• There is large SCC of growing size.
• The shape of the bow-tie seems to depend on the crawl
3. In- and out-degree distributions do not follow power laws.
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 23
Questions?
Advertisement
WebDataCommons.org also offers:
1. Corpus of 17 billion RDFa, Microdata, Microformats statements
2. Corpus of 147 million relational HTML tables
Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 24

Contenu connexe

Tendances

Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training ModelingMax De Marzi
 
Page rank
Page rankPage rank
Page rankluyen91
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slidesmahavir_a
 
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...Marko Rodriguez
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)Amir Fahmideh
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdfLars Albertsson
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremGrisha Weintraub
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptxDori Waldman
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation FinalEr. Jagrat Gupta
 

Tendances (20)

Neo4j Training Modeling
Neo4j Training ModelingNeo4j Training Modeling
Neo4j Training Modeling
 
Google Analytics 4 - Webinar (Smartup)
Google Analytics 4 - Webinar (Smartup)Google Analytics 4 - Webinar (Smartup)
Google Analytics 4 - Webinar (Smartup)
 
Page rank
Page rankPage rank
Page rank
 
BigQuery for Beginners
BigQuery for BeginnersBigQuery for Beginners
BigQuery for Beginners
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Web mining slides
Web mining slidesWeb mining slides
Web mining slides
 
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
Problem-Solving using Graph Traversals: Searching, Scoring, Ranking, and Reco...
 
Web mining (structure mining)
Web mining (structure mining)Web mining (structure mining)
Web mining (structure mining)
 
Data engineering in 10 years.pdf
Data engineering in 10 years.pdfData engineering in 10 years.pdf
Data engineering in 10 years.pdf
 
Redshift VS BigQuery
Redshift VS BigQueryRedshift VS BigQuery
Redshift VS BigQuery
 
Pagerank and hits
Pagerank and hitsPagerank and hits
Pagerank and hits
 
The Deep Web
The Deep WebThe Deep Web
The Deep Web
 
Web data mining
Web data miningWeb data mining
Web data mining
 
Dynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theoremDynamo and BigTable in light of the CAP theorem
Dynamo and BigTable in light of the CAP theorem
 
iceberg introduction.pptx
iceberg introduction.pptxiceberg introduction.pptx
iceberg introduction.pptx
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
DAMA International DMBOK V2 - Comparison with V1
DAMA International DMBOK V2 - Comparison with V1DAMA International DMBOK V2 - Comparison with V1
DAMA International DMBOK V2 - Comparison with V1
 
Web Mining Presentation Final
Web Mining Presentation FinalWeb Mining Presentation Final
Web Mining Presentation Final
 
Google BigQuery
Google BigQueryGoogle BigQuery
Google BigQuery
 
Seo and page rank algorithm
Seo and page rank algorithmSeo and page rank algorithm
Seo and page rank algorithm
 

En vedette

Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Webdailyye
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Kira
 
Wwsss intro2016-final
Wwsss intro2016-finalWwsss intro2016-final
Wwsss intro2016-finalSteffen Staab
 
Web Science Framework and InterDataNet
Web Science Framework and InterDataNetWeb Science Framework and InterDataNet
Web Science Framework and InterDataNetmaria chiara pettenati
 
Graphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social WebGraphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social WebJoël Perras
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 

En vedette (7)

Graph Structure In The Web
Graph Structure In The WebGraph Structure In The Web
Graph Structure In The Web
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Wwsss intro2016-final
Wwsss intro2016-finalWwsss intro2016-final
Wwsss intro2016-final
 
Web Science Framework and InterDataNet
Web Science Framework and InterDataNetWeb Science Framework and InterDataNet
Web Science Framework and InterDataNet
 
Graphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social WebGraphs, Edges & Nodes - Untangling the Social Web
Graphs, Edges & Nodes - Untangling the Social Web
 
Intro to Web Science (Fall 2013)
Intro to Web Science (Fall 2013)Intro to Web Science (Fall 2013)
Intro to Web Science (Fall 2013)
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 

Similaire à Graph Structure in the Web - Revisited. WWW2014 Web Science Track

The Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domainoli-unima
 
Creating knowledge out of interlinked data
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked dataSören Auer
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningIJERA Editor
 
Preparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectPreparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectBiplov Bhandari
 
TR14-05_Martindell.pdf
TR14-05_Martindell.pdfTR14-05_Martindell.pdf
TR14-05_Martindell.pdfTomTom149267
 
Restructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and HibernateRestructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and Hibernategustavoeliano
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitdDeepak Shevani
 
An End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering EnvironmentAn End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering Environmentjeffhobbs
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1Geoffrey Fox
 
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHPWEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHPzend
 
GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021remyguillaume
 
11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query response11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query responseAlexander Decker
 

Similaire à Graph Structure in the Web - Revisited. WWW2014 Web Science Track (20)

The Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domain
 
Creating knowledge out of interlinked data
Creating knowledge out of interlinked dataCreating knowledge out of interlinked data
Creating knowledge out of interlinked data
 
Web Page Recommendation Using Web Mining
Web Page Recommendation Using Web MiningWeb Page Recommendation Using Web Mining
Web Page Recommendation Using Web Mining
 
Preparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower ProjectPreparation of Web Mapping Application of Balephi-B Hydropower Project
Preparation of Web Mapping Application of Balephi-B Hydropower Project
 
TR14-05_Martindell.pdf
TR14-05_Martindell.pdfTR14-05_Martindell.pdf
TR14-05_Martindell.pdf
 
Df25632640
Df25632640Df25632640
Df25632640
 
WEB2.0 And CLOUD
WEB2.0 And CLOUDWEB2.0 And CLOUD
WEB2.0 And CLOUD
 
Restructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and HibernateRestructuring a Web Application, Using Spring and Hibernate
Restructuring a Web Application, Using Spring and Hibernate
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Deepak semantic web_iitd
Deepak semantic web_iitdDeepak semantic web_iitd
Deepak semantic web_iitd
 
HoLIS GIS Update
HoLIS GIS UpdateHoLIS GIS Update
HoLIS GIS Update
 
An End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering EnvironmentAn End User Perspective on Implementing Oracle in the Engineering Environment
An End User Perspective on Implementing Oracle in the Engineering Environment
 
LOD2 webinar series: Virtuoso by OpenLink Software
LOD2 webinar series: Virtuoso by OpenLink SoftwareLOD2 webinar series: Virtuoso by OpenLink Software
LOD2 webinar series: Virtuoso by OpenLink Software
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1CTS Conference Web 2.0 Tutorial Part 1
CTS Conference Web 2.0 Tutorial Part 1
 
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHPWEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
WEB 2.0: BUILDING RICH INTERNET APPLICATIONS WITH PHP
 
Hackference
HackferenceHackference
Hackference
 
Webware Webinar
Webware WebinarWebware Webinar
Webware Webinar
 
GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021GeoMapFish User-Group - March 2021
GeoMapFish User-Group - March 2021
 
11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query response11.concept for a web map implementation with faster query response
11.concept for a web map implementation with faster query response
 

Plus de Chris Bizer

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?Chris Bizer
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesChris Bizer
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingChris Bizer
 
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebChris Bizer
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Chris Bizer
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Chris Bizer
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Chris Bizer
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesChris Bizer
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million WebsitesChris Bizer
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsChris Bizer
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Chris Bizer
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureChris Bizer
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataChris Bizer
 

Plus de Chris Bizer (14)

GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
GPT4 versus BERT: Which Foundation Model is better for Web Data Integration?
 
Integrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning TechniquesIntegrating Product Data from the Semantic Web using Deep Learning Techniques
Integrating Product Data from the Semantic Web using Deep Learning Techniques
 
Using the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product MatchingUsing the Semantic Web as Training Data for Product Matching
Using the Semantic Web as Training Data for Product Matching
 
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open WebJIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
JIST2019 Keynote: Completing Knowledge Graphs using Data from the Open Web
 
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
Schema.org Annotations and Web Tables: Underexploited Semantic Nuggets on the...
 
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
Is the Semantic Web what we expected? Adoption Patterns and Content-driven Ch...
 
Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)Data Search and Search Joins (Universität Heidelberg 2015)
Data Search and Search Joins (Universität Heidelberg 2015)
 
Exploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web TablesExploring the Application Potential of Relational Web Tables
Exploring the Application Potential of Relational Web Tables
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
Extending Tables with Data from over a Million Websites
 Extending Tables with Data from over a Million Websites Extending Tables with Data from over a Million Websites
Extending Tables with Data from over a Million Websites
 
Adoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical DomainsAdoption of the Linked Data Best Practices in Different Topical Domains
Adoption of the Linked Data Best Practices in Different Topical Domains
 
Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications. Evolving the Web into a Global Database - Advances and Applications.
Evolving the Web into a Global Database - Advances and Applications.
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
DBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of DataDBpedia - An Interlinking Hub in the Web of Data
DBpedia - An Interlinking Hub in the Web of Data
 

Dernier

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermicultureTakeleZike1
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detailhaiderbaloch3
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024innovationoecd
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxzaydmeerab121
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptxpallavirawat456
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlshansessene
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 

Dernier (20)

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 
Organic farming with special reference to vermiculture
Organic farming with special reference to vermicultureOrganic farming with special reference to vermiculture
Organic farming with special reference to vermiculture
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Biological classification of plants with detail
Biological classification of plants with detailBiological classification of plants with detail
Biological classification of plants with detail
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024OECD bibliometric indicators: Selected highlights, April 2024
OECD bibliometric indicators: Selected highlights, April 2024
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
well logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptxwell logging & petrophysical analysis.pptx
well logging & petrophysical analysis.pptx
 
CHROMATOGRAPHY PALLAVI RAWAT.pptx
CHROMATOGRAPHY  PALLAVI RAWAT.pptxCHROMATOGRAPHY  PALLAVI RAWAT.pptx
CHROMATOGRAPHY PALLAVI RAWAT.pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
bonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girlsbonjourmadame.tumblr.com bhaskar's girls
bonjourmadame.tumblr.com bhaskar's girls
 
Pests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdfPests of castor_Binomics_Identification_Dr.UPR.pdf
Pests of castor_Binomics_Identification_Dr.UPR.pdf
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 

Graph Structure in the Web - Revisited. WWW2014 Web Science Track

  • 1. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 1 Graph Structure in the Web Revisited Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer
  • 2. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 2 Textbook Knowledge about the Web Graph Broder et al.: Graph structure in the Web. WWW2000. used two AltaVista crawls (200 million pages, 1.5 billion links) Results Power Laws Bow-Tie
  • 3. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 3 This talk will: 1. Show that the textbook knowledge might be wrong or dependent on crawling process. 2. Provide you with a large recent Web graph to do further research.
  • 4. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 4 Outline 1. Public Web Crawls 2. The Web Data Commons Hyperlink Graph 3. Analysis of the Graph 1. In-degree & Out-degree Distributions 2. Node Centrality 3. Strong Components 4. Bow Tie 5. Reachability and Average Shortest Path 4. Conclusion
  • 5. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 5 Public Web Crawls 1. AltaVista Crawl distributed by Yahoo! WebScope 2002 • Size: 1.4 billion pages • Problem: Largest strongly connected component 4% 2. ClueWeb 2009 • Size: 1 billion pages • Problem: Largest strongly connected component 3% 3. ClueWeb 2012 • Size: 733 million pages • Largest strongly connected component 76% • Problem: Only English pages
  • 6. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 6 The Common Crawl
  • 7. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 7 The Common Crawl Foundation Regularly publishes Web crawls on Amazon S3. Five crawls available so far: Crawling Strategy (Spring 2012) • breadth-first visiting strategy • at least 71 million seeds from previous crawls and from Wikipedia Date # Pages 2010 2.5 billion Spring 2012 3.5 billion Spring 2013 2.0 billion Winter 2013 2.0 billion Spring 2014 2.5 billion
  • 8. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 8 Web Data Commons – Hyperlink Graph extracted from the Spring 2012 version of the Common Crawl size 3.5 billion nodes 128 billion arcs pages originate from 43 million pay-level domains (PLDs) • 240 million PLDs were registered in 2012 * (18%) world-wide coverage * http://www.verisigninc.com/assets/domain-name-brief-oct2012.pdf
  • 9. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 9 Downloading the WDC Hyperlink Graph http://webdatacommons.org/hyperlinkgraph/ 4 aggregation levels: Extraction code is published under Apache License • Extraction costs per run: ~ 200 US$ in Amazon EC2 fees Graph #Nodes #Arcs Size (zipped) Page graph 3.56 billion 128.73 billion 376 GB Subdomain graph 101 million 2,043 million 10 GB 1st level subdomain graph 95 million 1,937 million 9.5 GB PLD graph 43 million 623 million 3.1 GB
  • 10. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 10 Analysis of the Graph
  • 11. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 11 In-Degree Distribution Broder et al. (2000) Power law with exponent 2.1 WDC Hyperlink Graph (2012) Best power law exponent 2.24
  • 12. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 12 In-Degree Distribution Power law fitted using plfit-tool. Maximum likelihood fitting. Starting degree: 1129 Best power law exponent: 2.24
  • 13. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 13 Goodness of Fit Test Method • Clauset et al.: Power-Law Distributions in Empirical Data. SIAM Review 2009. • p-value < 0.1  power law not a plausible hypothesis Goodness of fit result • p-value = 0 Conclusions: • in-degree does not follow power law • in-degree has non-fat heavy-tailed distribution • maybe log-normal?
  • 14. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 14 Out-Degree Distribution Broder et al.: Power law exponent 2.78 WDC: Best power law exponent 2.77 p-value = 0
  • 15. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 15 Node Centrality http://wwwranking.webdatacommons.org
  • 16. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 16 Average Degree Broder et al. 2000: 7.5 WDC 2012: 36.8  Factor 4.9 larger Possible explanation: HTML templates of CMS
  • 17. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 17 Strongly Connected Components Calculated using WebGraph framework on a machine with 1 TB RAM. Largest SCC Broder: 27.7% WDC: 51.3 %  Factor 1.8 larger
  • 18. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 18 The Bow-Tie Structure of Broder et al. 2000 Balanced size of IN and OUT: 21% Size of LSCC: 27%
  • 19. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 19 The Bow-Tie Structure of WDC Hyperlinkgraph 2012 IN much larger than OUT: 31% vs. 6% LSCC much larger: 51%
  • 20. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 20 Zhu et al. WWW2008 The Chinese web looks like a tea-pot.
  • 21. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 21 Reachability and Average Shortest Path Broder et al. 2000 Pairs of pages connected by path: 25% Average shortest path: 16.12 WDC Webgraph 2012 Pairs of pages connected by path: 48% Average shortest path: 12.84
  • 22. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 22 Conclusions 1. Web has become more dense and more connected • Average degree has grown significantly in last 13 years (factor 5) • Connectivity between pairs of pages has doubled 2. Macroscopic structure • There is large SCC of growing size. • The shape of the bow-tie seems to depend on the crawl 3. In- and out-degree distributions do not follow power laws.
  • 23. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 23 Questions? Advertisement WebDataCommons.org also offers: 1. Corpus of 17 billion RDFa, Microdata, Microformats statements 2. Corpus of 147 million relational HTML tables
  • 24. Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 24