SlideShare une entreprise Scribd logo
1  sur  36
Local Search
(Including ImportanceMetricsandLinkMerging)
Everythingyou wantedto know
about Crawling*
*ButDidn't KnowWhere to Ask
Agile SEO Meetup – South Jersey
Monday, September 10, 2012
7:00 PM to 9:00 PM
Bill Slawski
Webimax
@bill_slawski
In the Early Days of the Web,
there was an invasion
Robots
Spiders
Via Thomas Shahan - http://www.flickr.com/photos/opoterser/
Crawlers
Invaded pages across the World Wide Web
The Robots Mailing List was formed to solve the problem!
Led by a young Martijn Koster, they developed the Robots.txt protocol
Which Asked Robots to be Polite
And Not Melt Down Internet Servers
A student at Stanford named Lawrence Page went on
to co-author a paper on how robots might Crawl web
pages to index important pages first.
http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf
<<Insert Subliminal Advertisement Here>>
Important Web Pages
1. Contain words similar to a query that starts the crawl
2. Have a high backlink count
3. Have a high PageRank
4. Have a high forward link count
5. Are in or are close to the root directory for sites
Image via Fir0002/Flagstaffotos under http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License_1.2
So most crawlers will not only be
Polite, but they will also hunt down
important pages first
Search Engines filed patents on how they might crawl
and collect content found on Web pages, including collecting
URLs and Anchor Text associated with them.
<a href=“http://www.hungryrobots.com”>Feed Me</a>
http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7308643
Also, in one embodiment,
the robots are configured
to not follow "permanent
redirects". Thus, when a
robot encounters a URL
that is permanently
redirected to another
URL, the robot does not
automatically retrieve the
document at the target
address of the permanent
redirect.
“Use a text browser such as Lynx to examine your site,
because most search engine spiders see your site much as
Lynx would. If fancy features such as JavaScript, cookies,
session IDs, frames, DHTML, or Flash keep you from
seeing all of your site in a text browser, then search engine
spiders may have trouble crawling your site.”*
*Google Webmaster Guidelines - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769
Google’s Webmaster Guidelines make crawlers look pretty
unsophisticated, and incapable of much more than the simple
Lynx browser…
…But we have signs that crawlers can be smarter than that,
and Microsoft introduced a Vision-based Page Segmentation
Algorithm in 2003. Both Google and Yahoo have also published
patents and papers that describe smarter crawlers. IBM filed a patent
for a crawler in 2000 that is smarter than most browsers today.
VIPS: a Vision-based Page Segmentation Algorithm - http://research.microsoft.com/apps/pubs/default.aspx?id=70027
http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7519902
Link Merging
Web Site Structure Analysis - http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7861151
•S-nodes – Structural Link Blocks - organizational and navigational link blocks;
Repeated across pages with the same layout and showing the organization of the site.
They are often lists of links that don’t usually contain other content elements such as text.
•C-nodes – Content link blocks, grouped together by some kind of content association,
such as relating to the same topic or sub-topic. These blocks usually point to information
resources and aren’t likely to be repeated across more than one page.
•I-nodes – Isolated links, which are links on a page that aren’t part of a link group,
may be only loosely related to each other, by virtue of something like their
appearing together within the same paragraph of text. Each link on a page might be
considered an individual i-node, or they might be grouped together by page as an i-node.
Crawling and Self Help
Canonical = Best!
There can be only one:
http://example.com
http://www.example.com
http://example.com/
http://www.example.com/
https://example.com
https://www.example.com
https://example.com/
https://www.example.com/
http://example.com/index.htm
http://www.example.com/index.htm
https://example.com/index.htm
https://www.example.com/index.htm
http://example.com/INDEX.htm
http://www.example.com/INDEX.htm
https://example.com/INDEX.htm
https://www.example.com/INDEX.htm
http://example.com/Index.htm
http://www.example.com/Index.htm
https://example.com/Index.htm
https://www.example.com/Index.htm
Canonical Link Element
<link rel="canonical" href="http://example.com/page.html"/>
Rel=“prev” & rel=“next”
On the first page, http://www.example.com/article?story=abc&page=1,
<link rel="next" href="http://www.example.com/article?story=abc&page=2" />
On the second page, http://www.example.com/article?story=abc&page=2:
<link rel="prev" href="http://www.example.com/article?story=abc&page=1" />
<link rel="next" href="http://www.example.com/article?story=abc&page=3" />
On the third page, http://www.example.com/article?story=abc&page=3
<link rel="prev" href="http://www.example.com/article?story=abc&page=2" />
<link rel="next" href="http://www.example.com/article?story=abc&page=4" />
And on the last page, http://www.example.com/article?story=abc&page=4:
<link rel="prev" href="http://www.example.com/article?story=abc&page=3" />
Paginated Product Pages
Paginated Article Pages
View All Pages
Option 1
• Normal Prev/Next sequence
• Self Referential Canonicals (point to their Own URL
• Noindex meta element on View All page
Option 2
• Normal Prev/Next Sequence
• Canonicals (all pages use the view-all page URL)
http://googlewebmastercentral.blogspot.com/2011/09/view-all-in-search-results.html
Rel=“hreflang”
Rel=“hreflang”
HTML link element.
In the HTML <head> section of http://www.example.com/, add
a link element pointing to the Spanish version of that webpage at
http://es.example.com/, like this:
<link rel="alternate" hreflang="es" href="http://es.example.com/" />
HTTP header.
If you publish non-HTML files (like PDFs), you can use an
HTTP header to indicate a different language version of a URL:
Link: <http://es.example.com/>; rel="alternate"; hreflang="es"
Sitemap.
Instead of using markup, you can submit language version
information in a Sitemap.
Rel=“hreflang” XML Sitemap
<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/
0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>http://www.example.com/english/</loc>
<xhtml:link
rel="alternate"
hreflang="de"
href="http://www.example.com/deutsch/"
/>
<xhtml:link
rel="alternate"
hreflang="de-ch"
href="http://www.example.com/schweiz-
deutsch/"
/>
<xhtml:link
rel="alternate"
hreflang="en"
href="http://www.example.com/english/"
/>
</url>
XML Sitemap
XML Sitemap
•Use Canonical links
•Remove 404s
•Don’t set priority past 1 week
•If more than 50,000 URLs, use multiple Sitemaps
and a site index
•Validate with an XML Sitemap Validator
•Include a Sitemap statement in robots.txt
http://www.sitemaps.org/
Next, we study which of the two crawl systems, Sitemaps and Discovery,
sees URLs first. We conduct this test over a dataset consisting of over five
billion URLs that were seen by both systems.
According to the most recent statistics at the time of the writing,
78% of these URLs were seen by Sitemaps first, compared to
22% that were seen through Discovery first.
Crawling vs. XML
Sitemaps: Above and Beyond the Crawl of Duty –
http://www.shuri.org/publications/www2009_sitemaps.pdf
Crawling Social Media
Ranking of Search Results based on Microblog data - http://appft.uspto.gov/netacgi/nph-
Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f=
G&l=50&s1=%2220110246457%22.PGNR.&OS=DN/20110246457&RS=DN/20110246457
Questions?
Bill Slawski
Webimax
@bill_slawski

Contenu connexe

Tendances

All seo foot prints
All seo foot printsAll seo foot prints
All seo foot printsazad008
 
Who Wants to Use QR Codes
Who Wants to Use QR CodesWho Wants to Use QR Codes
Who Wants to Use QR CodesJudy Horn
 
The Basics of Blogging and Web Site Creation - Part One: Content Is King
The Basics of Blogging and Web Site Creation - Part One: Content Is KingThe Basics of Blogging and Web Site Creation - Part One: Content Is King
The Basics of Blogging and Web Site Creation - Part One: Content Is Kinggismosmoney
 
SEO - Stop Eating Your Words - Avoid Cannibalisation Of Your Sites
SEO - Stop Eating Your Words - Avoid Cannibalisation Of Your SitesSEO - Stop Eating Your Words - Avoid Cannibalisation Of Your Sites
SEO - Stop Eating Your Words - Avoid Cannibalisation Of Your SitesDawn Anderson MSc DigM
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUJason Mun
 
SEO Cannibalisation of Your Own SEO Success
SEO Cannibalisation of Your Own SEO SuccessSEO Cannibalisation of Your Own SEO Success
SEO Cannibalisation of Your Own SEO SuccessDawn Anderson MSc DigM
 
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...Ronald Soh
 
EDU and GOV Dofollow Backlinks 2017
EDU and GOV Dofollow Backlinks 2017EDU and GOV Dofollow Backlinks 2017
EDU and GOV Dofollow Backlinks 2017besttopinfo
 
The DiSo Project and the Open Web
The DiSo Project and the Open WebThe DiSo Project and the Open Web
The DiSo Project and the Open WebChris Messina
 
The Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You Need
The Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You NeedThe Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You Need
The Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You Needfrankmo920
 
Content Strategy for Responsive Websites
Content Strategy for Responsive WebsitesContent Strategy for Responsive Websites
Content Strategy for Responsive WebsitesClarissa Peterson
 
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Dawn Anderson MSc DigM
 
Negotiating crawl budget with googlebots
Negotiating crawl budget with googlebotsNegotiating crawl budget with googlebots
Negotiating crawl budget with googlebotsDawn Anderson MSc DigM
 
2000 Directories with ranking
2000 Directories with ranking2000 Directories with ranking
2000 Directories with rankingsame2cool
 
How to connect social media with open standards
How to connect social media with open standardsHow to connect social media with open standards
How to connect social media with open standardsGlenn Jones
 
The ultimate guide to the invisible web
The ultimate guide to the invisible webThe ultimate guide to the invisible web
The ultimate guide to the invisible webYKNIB O
 
Hungarian Web Conference: HTML5 beyond the hype - let's make it work!
Hungarian Web Conference: HTML5 beyond the hype - let's make it work!Hungarian Web Conference: HTML5 beyond the hype - let's make it work!
Hungarian Web Conference: HTML5 beyond the hype - let's make it work!Christian Heilmann
 

Tendances (20)

All seo foot prints
All seo foot printsAll seo foot prints
All seo foot prints
 
Who Wants to Use QR Codes
Who Wants to Use QR CodesWho Wants to Use QR Codes
Who Wants to Use QR Codes
 
Seo basics part 3
Seo basics part 3Seo basics part 3
Seo basics part 3
 
The Basics of Blogging and Web Site Creation - Part One: Content Is King
The Basics of Blogging and Web Site Creation - Part One: Content Is KingThe Basics of Blogging and Web Site Creation - Part One: Content Is King
The Basics of Blogging and Web Site Creation - Part One: Content Is King
 
SEO - Stop Eating Your Words - Avoid Cannibalisation Of Your Sites
SEO - Stop Eating Your Words - Avoid Cannibalisation Of Your SitesSEO - Stop Eating Your Words - Avoid Cannibalisation Of Your Sites
SEO - Stop Eating Your Words - Avoid Cannibalisation Of Your Sites
 
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AUKeeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
Keeping Things Lean & Mean: Crawl Optimisation - Search Marketing Summit AU
 
Facebook Coin
Facebook CoinFacebook Coin
Facebook Coin
 
SEO Cannibalisation of Your Own SEO Success
SEO Cannibalisation of Your Own SEO SuccessSEO Cannibalisation of Your Own SEO Success
SEO Cannibalisation of Your Own SEO Success
 
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
Google Search Engine Ranking Position - 200 Top Ranking Factors for SEO Marke...
 
EDU and GOV Dofollow Backlinks 2017
EDU and GOV Dofollow Backlinks 2017EDU and GOV Dofollow Backlinks 2017
EDU and GOV Dofollow Backlinks 2017
 
The DiSo Project and the Open Web
The DiSo Project and the Open WebThe DiSo Project and the Open Web
The DiSo Project and the Open Web
 
The Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You Need
The Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You NeedThe Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You Need
The Ultimate Guide to Scrapebox - The Only Scrapebox Tutorial You Need
 
Content Strategy for Responsive Websites
Content Strategy for Responsive WebsitesContent Strategy for Responsive Websites
Content Strategy for Responsive Websites
 
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
Technical SEO Myths Facts And Theories On Crawl Budget And The Importance Of ...
 
Negotiating crawl budget with googlebots
Negotiating crawl budget with googlebotsNegotiating crawl budget with googlebots
Negotiating crawl budget with googlebots
 
2000 Directories with ranking
2000 Directories with ranking2000 Directories with ranking
2000 Directories with ranking
 
How to connect social media with open standards
How to connect social media with open standardsHow to connect social media with open standards
How to connect social media with open standards
 
SEO Quick Wins: The Small Things that Make The Big Differences
SEO Quick Wins: The Small Things that Make The Big DifferencesSEO Quick Wins: The Small Things that Make The Big Differences
SEO Quick Wins: The Small Things that Make The Big Differences
 
The ultimate guide to the invisible web
The ultimate guide to the invisible webThe ultimate guide to the invisible web
The ultimate guide to the invisible web
 
Hungarian Web Conference: HTML5 beyond the hype - let's make it work!
Hungarian Web Conference: HTML5 beyond the hype - let's make it work!Hungarian Web Conference: HTML5 beyond the hype - let's make it work!
Hungarian Web Conference: HTML5 beyond the hype - let's make it work!
 

Similaire à Everything you wanted to know about crawling, but didn't know where to ask

The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018STELIANCREANGA
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Webostephens
 
Semantic Web
Semantic WebSemantic Web
Semantic Webgregreser
 
Inbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFestInbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFestJustin Briggs
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine SpidersCJ Jenkins
 
Semantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsSemantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsEmanuele Della Valle
 
What Can schema.Org Offer The Web Manager?
What Can schema.Org Offer The Web Manager?What Can schema.Org Offer The Web Manager?
What Can schema.Org Offer The Web Manager?Phil Barker
 
Microdata semantic-extend
Microdata semantic-extendMicrodata semantic-extend
Microdata semantic-extendSeek Tan
 
Adaptive Blue Sem Tech Meetup Nyc
Adaptive Blue Sem Tech Meetup NycAdaptive Blue Sem Tech Meetup Nyc
Adaptive Blue Sem Tech Meetup NycAlex Iskold
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebAnkit Solanki
 
Seo isn't that hard
Seo isn't that hardSeo isn't that hard
Seo isn't that hardlelandf
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET Journal
 
Web2 And Java
Web2 And JavaWeb2 And Java
Web2 And Javasenejug
 
When responsive web design meets the real world
When responsive web design meets the real worldWhen responsive web design meets the real world
When responsive web design meets the real worldJason Grigsby
 
Seo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactSeo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactNikola Minkov
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_staticLincoln III
 

Similaire à Everything you wanted to know about crawling, but didn't know where to ask (20)

The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018The ultimate guide to web scraping 2018
The ultimate guide to web scraping 2018
 
Large-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate GuideLarge-Scale Web Scraping: An Ultimate Guide
Large-Scale Web Scraping: An Ultimate Guide
 
The Semantic Web
The Semantic WebThe Semantic Web
The Semantic Web
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 
Inbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFestInbound Marketing Tools - SearchFest
Inbound Marketing Tools - SearchFest
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
BrightonSEO
BrightonSEOBrightonSEO
BrightonSEO
 
NCompass Live: RSS: Feed Me
NCompass Live: RSS: Feed MeNCompass Live: RSS: Feed Me
NCompass Live: RSS: Feed Me
 
Semantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientistsSemantic Web, an introduction for bioscientists
Semantic Web, an introduction for bioscientists
 
What Can schema.Org Offer The Web Manager?
What Can schema.Org Offer The Web Manager?What Can schema.Org Offer The Web Manager?
What Can schema.Org Offer The Web Manager?
 
Microdata semantic-extend
Microdata semantic-extendMicrodata semantic-extend
Microdata semantic-extend
 
Adaptive Blue Sem Tech Meetup Nyc
Adaptive Blue Sem Tech Meetup NycAdaptive Blue Sem Tech Meetup Nyc
Adaptive Blue Sem Tech Meetup Nyc
 
Evolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic WebEvolutionary & Swarm Computing for the Semantic Web
Evolutionary & Swarm Computing for the Semantic Web
 
Senior Project Documentation.
Senior Project Documentation.Senior Project Documentation.
Senior Project Documentation.
 
Seo isn't that hard
Seo isn't that hardSeo isn't that hard
Seo isn't that hard
 
IRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine OptimizationIRJET - Review on Search Engine Optimization
IRJET - Review on Search Engine Optimization
 
Web2 And Java
Web2 And JavaWeb2 And Java
Web2 And Java
 
When responsive web design meets the real world
When responsive web design meets the real worldWhen responsive web design meets the real world
When responsive web design meets the real world
 
Seo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / SerpactSeo audit fitpass.co.in via Nikola Minkov / Serpact
Seo audit fitpass.co.in via Nikola Minkov / Serpact
 
2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static2012 03 27_philly_jug_rewrite_static
2012 03 27_philly_jug_rewrite_static
 

Plus de Bill Slawski

William slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchWilliam slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchBill Slawski
 
Semantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConSemantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConBill Slawski
 
SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0Bill Slawski
 
Image Search, Image Query Mapping and Image Optimization
Image Search, Image Query Mapping and Image OptimizationImage Search, Image Query Mapping and Image Optimization
Image Search, Image Query Mapping and Image OptimizationBill Slawski
 
SMXL Milan 2019 Graphs of Things
SMXL Milan 2019   Graphs of ThingsSMXL Milan 2019   Graphs of Things
SMXL Milan 2019 Graphs of ThingsBill Slawski
 
Smxl milan 2019 keyword school
Smxl milan 2019   keyword schoolSmxl milan 2019   keyword school
Smxl milan 2019 keyword schoolBill Slawski
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering Bill Slawski
 
Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)Bill Slawski
 
Guidelines and best practices for successful seo william slawski smxl milan...
Guidelines and best practices for successful seo   william slawski smxl milan...Guidelines and best practices for successful seo   william slawski smxl milan...
Guidelines and best practices for successful seo william slawski smxl milan...Bill Slawski
 
Seo; Cutting Through The Noise
Seo; Cutting Through The NoiseSeo; Cutting Through The Noise
Seo; Cutting Through The NoiseBill Slawski
 
Smx advanced-william-slawski-final
Smx advanced-william-slawski-finalSmx advanced-william-slawski-final
Smx advanced-william-slawski-finalBill Slawski
 
Keyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic WebKeyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic WebBill Slawski
 
Knowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic MarkupKnowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic MarkupBill Slawski
 
Bill Slawski SEO and the New Search Results
Bill Slawski   SEO and the New Search ResultsBill Slawski   SEO and the New Search Results
Bill Slawski SEO and the New Search ResultsBill Slawski
 
Evolution of Search
Evolution of SearchEvolution of Search
Evolution of SearchBill Slawski
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphBill Slawski
 
Semantic seo and the evolution of queries
Semantic seo and the evolution of queriesSemantic seo and the evolution of queries
Semantic seo and the evolution of queriesBill Slawski
 
Slawskiwilliam thegrowthofdirectanswers
Slawskiwilliam thegrowthofdirectanswersSlawskiwilliam thegrowthofdirectanswers
Slawskiwilliam thegrowthofdirectanswersBill Slawski
 
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...Bill Slawski
 
Hummingbird & the entity revolution
Hummingbird & the entity revolutionHummingbird & the entity revolution
Hummingbird & the entity revolutionBill Slawski
 

Plus de Bill Slawski (20)

William slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-searchWilliam slawski-google-patents- how-do-they-influence-search
William slawski-google-patents- how-do-they-influence-search
 
Semantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA ConSemantic search Bill Slawski DEEP SEA Con
Semantic search Bill Slawski DEEP SEA Con
 
SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0SEO & Patents Vrtualcon v. 3.0
SEO & Patents Vrtualcon v. 3.0
 
Image Search, Image Query Mapping and Image Optimization
Image Search, Image Query Mapping and Image OptimizationImage Search, Image Query Mapping and Image Optimization
Image Search, Image Query Mapping and Image Optimization
 
SMXL Milan 2019 Graphs of Things
SMXL Milan 2019   Graphs of ThingsSMXL Milan 2019   Graphs of Things
SMXL Milan 2019 Graphs of Things
 
Smxl milan 2019 keyword school
Smxl milan 2019   keyword schoolSmxl milan 2019   keyword school
Smxl milan 2019 keyword school
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering
 
Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)Changes in Structured Data at Google (SEO Camp 'us in Paris)
Changes in Structured Data at Google (SEO Camp 'us in Paris)
 
Guidelines and best practices for successful seo william slawski smxl milan...
Guidelines and best practices for successful seo   william slawski smxl milan...Guidelines and best practices for successful seo   william slawski smxl milan...
Guidelines and best practices for successful seo william slawski smxl milan...
 
Seo; Cutting Through The Noise
Seo; Cutting Through The NoiseSeo; Cutting Through The Noise
Seo; Cutting Through The Noise
 
Smx advanced-william-slawski-final
Smx advanced-william-slawski-finalSmx advanced-william-slawski-final
Smx advanced-william-slawski-final
 
Keyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic WebKeyword Research and Topic Modeling in a Semantic Web
Keyword Research and Topic Modeling in a Semantic Web
 
Knowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic MarkupKnowledge Panels, Rich Snippets and Semantic Markup
Knowledge Panels, Rich Snippets and Semantic Markup
 
Bill Slawski SEO and the New Search Results
Bill Slawski   SEO and the New Search ResultsBill Slawski   SEO and the New Search Results
Bill Slawski SEO and the New Search Results
 
Evolution of Search
Evolution of SearchEvolution of Search
Evolution of Search
 
Ranking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge GraphRanking in Google Since The Advent of The Knowledge Graph
Ranking in Google Since The Advent of The Knowledge Graph
 
Semantic seo and the evolution of queries
Semantic seo and the evolution of queriesSemantic seo and the evolution of queries
Semantic seo and the evolution of queries
 
Slawskiwilliam thegrowthofdirectanswers
Slawskiwilliam thegrowthofdirectanswersSlawskiwilliam thegrowthofdirectanswers
Slawskiwilliam thegrowthofdirectanswers
 
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
Semantic Web, Knowledge Graph, and Other Changes to SERPS – A Google Semantic...
 
Hummingbird & the entity revolution
Hummingbird & the entity revolutionHummingbird & the entity revolution
Hummingbird & the entity revolution
 

Dernier

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Dernier (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Everything you wanted to know about crawling, but didn't know where to ask

  • 1. Local Search (Including ImportanceMetricsandLinkMerging) Everythingyou wantedto know about Crawling* *ButDidn't KnowWhere to Ask Agile SEO Meetup – South Jersey Monday, September 10, 2012 7:00 PM to 9:00 PM Bill Slawski Webimax @bill_slawski
  • 2. In the Early Days of the Web, there was an invasion
  • 4. Spiders Via Thomas Shahan - http://www.flickr.com/photos/opoterser/
  • 6. Invaded pages across the World Wide Web
  • 7. The Robots Mailing List was formed to solve the problem!
  • 8. Led by a young Martijn Koster, they developed the Robots.txt protocol
  • 9. Which Asked Robots to be Polite
  • 10. And Not Melt Down Internet Servers
  • 11. A student at Stanford named Lawrence Page went on to co-author a paper on how robots might Crawl web pages to index important pages first. http://ilpubs.stanford.edu:8090/347/1/1998-51.pdf
  • 13. Important Web Pages 1. Contain words similar to a query that starts the crawl 2. Have a high backlink count 3. Have a high PageRank 4. Have a high forward link count 5. Are in or are close to the root directory for sites Image via Fir0002/Flagstaffotos under http://commons.wikimedia.org/wiki/Commons:GNU_Free_Documentation_License_1.2
  • 14. So most crawlers will not only be Polite, but they will also hunt down important pages first
  • 15. Search Engines filed patents on how they might crawl and collect content found on Web pages, including collecting URLs and Anchor Text associated with them. <a href=“http://www.hungryrobots.com”>Feed Me</a> http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7308643
  • 16. Also, in one embodiment, the robots are configured to not follow "permanent redirects". Thus, when a robot encounters a URL that is permanently redirected to another URL, the robot does not automatically retrieve the document at the target address of the permanent redirect.
  • 17. “Use a text browser such as Lynx to examine your site, because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.”* *Google Webmaster Guidelines - http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769
  • 18. Google’s Webmaster Guidelines make crawlers look pretty unsophisticated, and incapable of much more than the simple Lynx browser… …But we have signs that crawlers can be smarter than that, and Microsoft introduced a Vision-based Page Segmentation Algorithm in 2003. Both Google and Yahoo have also published patents and papers that describe smarter crawlers. IBM filed a patent for a crawler in 2000 that is smarter than most browsers today.
  • 19. VIPS: a Vision-based Page Segmentation Algorithm - http://research.microsoft.com/apps/pubs/default.aspx?id=70027
  • 21. Link Merging Web Site Structure Analysis - http://patft1.uspto.gov/netacgi/nph-Parser?patentnumber=7861151 •S-nodes – Structural Link Blocks - organizational and navigational link blocks; Repeated across pages with the same layout and showing the organization of the site. They are often lists of links that don’t usually contain other content elements such as text. •C-nodes – Content link blocks, grouped together by some kind of content association, such as relating to the same topic or sub-topic. These blocks usually point to information resources and aren’t likely to be repeated across more than one page. •I-nodes – Isolated links, which are links on a page that aren’t part of a link group, may be only loosely related to each other, by virtue of something like their appearing together within the same paragraph of text. Each link on a page might be considered an individual i-node, or they might be grouped together by page as an i-node.
  • 23. Canonical = Best! There can be only one: http://example.com http://www.example.com http://example.com/ http://www.example.com/ https://example.com https://www.example.com https://example.com/ https://www.example.com/ http://example.com/index.htm http://www.example.com/index.htm https://example.com/index.htm https://www.example.com/index.htm http://example.com/INDEX.htm http://www.example.com/INDEX.htm https://example.com/INDEX.htm https://www.example.com/INDEX.htm http://example.com/Index.htm http://www.example.com/Index.htm https://example.com/Index.htm https://www.example.com/Index.htm
  • 24. Canonical Link Element <link rel="canonical" href="http://example.com/page.html"/>
  • 25. Rel=“prev” & rel=“next” On the first page, http://www.example.com/article?story=abc&page=1, <link rel="next" href="http://www.example.com/article?story=abc&page=2" /> On the second page, http://www.example.com/article?story=abc&page=2: <link rel="prev" href="http://www.example.com/article?story=abc&page=1" /> <link rel="next" href="http://www.example.com/article?story=abc&page=3" /> On the third page, http://www.example.com/article?story=abc&page=3 <link rel="prev" href="http://www.example.com/article?story=abc&page=2" /> <link rel="next" href="http://www.example.com/article?story=abc&page=4" /> And on the last page, http://www.example.com/article?story=abc&page=4: <link rel="prev" href="http://www.example.com/article?story=abc&page=3" />
  • 28. View All Pages Option 1 • Normal Prev/Next sequence • Self Referential Canonicals (point to their Own URL • Noindex meta element on View All page Option 2 • Normal Prev/Next Sequence • Canonicals (all pages use the view-all page URL) http://googlewebmastercentral.blogspot.com/2011/09/view-all-in-search-results.html
  • 30. Rel=“hreflang” HTML link element. In the HTML <head> section of http://www.example.com/, add a link element pointing to the Spanish version of that webpage at http://es.example.com/, like this: <link rel="alternate" hreflang="es" href="http://es.example.com/" /> HTTP header. If you publish non-HTML files (like PDFs), you can use an HTTP header to indicate a different language version of a URL: Link: <http://es.example.com/>; rel="alternate"; hreflang="es" Sitemap. Instead of using markup, you can submit language version information in a Sitemap.
  • 31. Rel=“hreflang” XML Sitemap <?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/ 0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <url> <loc>http://www.example.com/english/</loc> <xhtml:link rel="alternate" hreflang="de" href="http://www.example.com/deutsch/" /> <xhtml:link rel="alternate" hreflang="de-ch" href="http://www.example.com/schweiz- deutsch/" /> <xhtml:link rel="alternate" hreflang="en" href="http://www.example.com/english/" /> </url>
  • 33. XML Sitemap •Use Canonical links •Remove 404s •Don’t set priority past 1 week •If more than 50,000 URLs, use multiple Sitemaps and a site index •Validate with an XML Sitemap Validator •Include a Sitemap statement in robots.txt http://www.sitemaps.org/
  • 34. Next, we study which of the two crawl systems, Sitemaps and Discovery, sees URLs first. We conduct this test over a dataset consisting of over five billion URLs that were seen by both systems. According to the most recent statistics at the time of the writing, 78% of these URLs were seen by Sitemaps first, compared to 22% that were seen through Discovery first. Crawling vs. XML Sitemaps: Above and Beyond the Crawl of Duty – http://www.shuri.org/publications/www2009_sitemaps.pdf
  • 35. Crawling Social Media Ranking of Search Results based on Microblog data - http://appft.uspto.gov/netacgi/nph- Parser?Sect1=PTO1&Sect2=HITOFF&d=PG01&p=1&u=%2Fnetahtml%2FPTO%2Fsrchnum.html&r=1&f= G&l=50&s1=%2220110246457%22.PGNR.&OS=DN/20110246457&RS=DN/20110246457