Gathering accurate index status data for websites with thousands or millions of URLs has been historically impossible. In this talk, I will show you a hands-on approach to get this data using Node.js and how you can use it to inform your SEO strategy.
Checking Google Index Status at Scale using Node.js - Jose Hernando - BrightonSEO Oct 2020
1. Checking Google Index status at scale with Node.js
Checking
Google Index status
at scale with Node.js
Jose Luis Hernando
@jlhernando #BrightonSEO
Senior Technical SEO Consultant
2. Checking Google Index status at scale with Node.js
Today’s agenda
1. Why it’s important to know your website’s indexing status
2. The challenge to extract this data
3. Getting the data with Node.js – Live Demo!
4. Using this data for your SEO strategy
3. Checking Google Index status at scale with Node.js
Why is it important?
Reason #1
Not in the Index => Not in the SERPs
Icons from Google, Flaticon & Sitecheckerpro
4. Checking Google Index status at scale with Node.js
Why is it important?
Reason #2
Google evaluates site quality based on indexed pages
Sources:
Google Only Can Judge Site Quality Based On Pages They Index – Barry Swartz (Search Engine Roundtable)
English Google Webmaster Central office-hours hangout – Google Webmasters YouTube Channel
Low Quality Pages
Uncontrolled Faceted Navigation URLs
Unsupervised User Generated Content
Indexable Non-Canonical URLs
High Quality Pages
Category Pages
Editorial Pages
Canonical Product Pages
+
5. Checking Google Index status at scale with Node.js
Why is it important?
Reason #3
Inefficient use of Google’s resources
https://website.com/category-one/
HTML CSS JS
/category-one/?color=red
/category-one/?color=blue
/category-one/?color=red&blue
…
∞
6. Checking Google Index status at scale with Node.js
71.7%
54.3%
41.7%
34.4%
45.3%
30.2%
15.1%
10.1%
1-10k
10k-100k
100k-1M
1M+
Avg. Crawl Ratio (%) Avg. Active Ratio (%)
Source: How Does Google Crawl the Web? – (Annabelle Bouard & Dimitri Brunel – Botify)
Crawl Ratio
Percentage of pages
crawled by Google in 30 days
Active Ratio
Percentage of pages that
have generated at least
one organic visit in 30 days.
How much of your site is Googlebot crawling?
7. Checking Google Index status at scale with Node.js
The challenge
to extract this data
• Googlebot’s crawling behaviour
doesn’t determine indexing status
8. Checking Google Index status at scale with Node.js
The challenge:
extracting this data
• Googlebot’s crawling behaviour
doesn’t determine indexing status
• You rely on partial and sometimes
inaccurate data points:
• site: & inurl: operators
• GSC Indexing reports:
• URL Inspection Tool (< 200 URLs /day)
• Coverage Reports (< 1,000 rows /
report)
11. Checking Google Index status at scale with Node.js
{Live demo}
bit.ly/google-index-checker-script
12. Checking Google Index status at scale with Node.js
Using the following method
goes against Google’s Terms of Service
as it automatically requests search queries from Google Search
Quick FYI
13. Checking Google Index status at scale with Node.js
Our script outperforms every other method available
14. Checking Google Index status at scale with Node.js
How can you use Google index
data?
Identify inefficient
use of crawl budget
Error Prioritisation
Identify holes
in your
architecture
Check for pages from your
site that should be indexed
but are not.
Find pages that should not be
indexed but are indexed.
Detect pages that used to
exist and now return an error
(4xx) but are still indexed.
15. Checking Google Index status at scale with Node.js
Use case #1
Sitemap Health Check
How many URLs from your XML sitemap are
indexed?
• 200 Status Code – 81,688
Inspired by Data Secrets of the Index Coverage Report – AJ Kohn
Sitemaps = 111,772
URLs
80% Indexed 74,223
7,465
Google Index Status of 2xx URLs
from Sitemap
Indexed Not Indexed
16. Checking Google Index status at scale with Node.js
Use case #1
Sitemap Health Check
How many URLs from your XML sitemap are
indexed?
• 200 Status Code – 81,688
• 404 Status Code – 29,969
Inspired by Data Secrets of the Index Coverage Report – AJ Kohn
Sitemaps = 111,772
URLs
80% Indexed
21% Indexed
6,268
23,701
Google Index Status of 4xx URLs
from Sitemap
Indexed Not Indexed
17. Checking Google Index status at scale with Node.js
Use case #1
Sitemap Health Check
How many URLs from your XML sitemap are
indexed?
• 200 Status Code – 81,688
• 404 Status Code – 29,969
• 301 Status Code – 365
Inspired by Data Secrets of the Index Coverage Report – AJ Kohn
Sitemaps = 111,772
URLs
80% Indexed
21% Indexed
4% Indexed
16 349
Google Index Status of 3xx URLs
from Sitemap
Indexed Not Indexed
18. Checking Google Index status at scale with Node.js
Sitemap Health Check
Next Steps
1) Identify if these URLs are important to your site’s bottom line
2) Check if a pool of these URLs have issues on GSC’s
Index Coverage Report
3) Choose a tactic to improve the visibility of these URLs
4) Isolate the relevant URLs and modify the existing sitemap or create a
new-sitemap.xml to monitor progress
19. Checking Google Index status at scale with Node.js
Use case #2
Log File Analysis Plus+
How many URLs with Googlebot hits are
indexed?
• ~160k Googlebot hits to non-canonical URLs
(/Uppercase/ vs /lowercase/)
• Identified if non-canonical URLs were indexed
• Identified if the referenced canonical URLs
were indexed
35.8%
64.2%
Indexed Non-Canonical URLs
Requested by Googlebot
Indexed Not Indexed
Undisclosed Client
20. Checking Google Index status at scale with Node.js
Log File Analysis+
Next Steps
1) Identify if the canonical tag is correctly placed
2) Identify if the root cause is internal linking, external linking or other
3) Consider redirecting non-canonical URLs to canonical URLs
4) Create a new-sitemap.xml with problematic URLs to encourage
Googlebot revisiting those URLs and for monitoring purposes
21. Checking Google Index status at scale with Node.js
• Check Real-time indexing (News sites, Offer sites, Job Boards)
• Check uncontrolled faceted navigation (Crawl budget optimisation)
• Check inactive product/category URLs – (Site architecture
improvements)
• Check old 4xx that are live now & haven't been deindexed yet (Recover
organic opportunities)
Other use cases
Inform your SEO strategy
22. Checking Google Index status at scale with Node.js
Further reading
https://bit.ly/google-index-checks
23. Checking Google Index status at scale with Node.js
Further reading
https://bit.ly/gsc-index-coverage
24. Checking Google Index status at scale with Node.js
The Google Index Checker script has opened a door
to get useful, actionable data at scale for your sites
Use it, and act on it.
25. Checking Google Index status at scale with Node.js
Thank you.
builtvisible.com
Jose Luis Hernando
Senior Technical SEO Consultant
@jlhernando
26. Checking Google Index status at scale with Node.js
How does Google crawl the web – Annabelle Bouard & Dimitri Brunel (Botify)
English Google Webmaster Central office-hours hangout – Google Webmasters YouTube Channel
Google Only Can Judge Site Quality Based On Pages They Index – Barry Swartz (Search Engine Roundtable)
Data Secrets of the Index Coverage Report - Blind Five Year Old (AJ Kohn)
How Google Search Works – Google Documentation
How Search organises information – Google Documentation
Our new search index: Caffeine - Carrie Grimes
When indexing goes wrong: how Google Search recovered from indexing issues & lessons learned since -
Vincent Courson, Google Search Outreach
How Search Engines Work: Crawling, Indexing & Ranking – Moz
(Please) Stop Using Unsafe Characters in URLs – Jeff Starr
Sources & additional reading
Notes de l'éditeur
Technical SEO Consultant at Builtvisible
Builtvisible is a Digital Marketing Agency focusing exclusively on Organic Performance. We are specialist in Technical SEO, Content Strategy, Digital PR and Analytics and we deal primarily with medium and large-scale sites targeting both national and global audiences online.
If you’re not in Google’s index you will not appear in Google SERPs
To appear in Search Results, Google has to discover, crawl, render and index your website’s pages.
Only once you’re in the index, you will be eligible to appear in SERPs and then you can acquire users through organic search.
If you don’t know which pages are indexed you don’t know which pages can acquire users organically
Pages that you’ve probably spent lots of time customising to serve users. These pages will be evaluated in the same way as low quality pages that are indexable:
Uncontrolled facet nav
USG
Non-canonicals
If you have an e-com site that has uncontrolled faceted navigation, Gbot will have to download that page (and its resources) to evaluate if that page is valuable.
If for example, you have uncontrolled facet navigation, Gbot will have to crawl and render those URLs to see if these pages contain valuable information for future user query.
Since this is not controlled, it can go ad-infinitum and hence wasting Google’s resources on URLs that are very likely not as valuable as others that you have in your site architecture.
Key step in the indexing pipeling Crawling
In order for Google to Index your site it needs to crawl your site. But how much of your site is Googlebot crawling?
According to a study from Botify using 270 sites with different architecture sizes, certainly not all of it.
In this graph there are 2 important concepts: Crawl Ratio & Active ration (explain)
If you are dealing with a site that has less than 10k URLs, Google is crawling on avg. 71% of your site and only 45% of that gets organic clicks.
If we continue increasing the size of a website we can see that the rate at which Googlebot crawls your site, declines more and more.
To the point where, if your site has more than 1M URLs, Googlebot crawls on average only 34% of your site and only 10% of those URLs get clicks from Organic Search
Challenges
Even if you are lucky enough to have access to your logs on a regular basis, Googlebot’s crawling behaviour doesn’t determine indexing status - You cannot guarantee that those URLs that have not received clicks from Google Search are actually part of Google’s index
2) If you don’t have access to server logs you have even less data, and hence you rely information that Google provides you through:
a) site: & inurl: operators Rough estimate for site-wide numbers and a lot of times inaccurate info for individual URLs
b) Google Search Console reports Inspection Tool (Great but you hit quota limit after 200 URLs and hence a bit pointless to automate) Coverage/Sitemap Coverage reports (Great but GSC only allows 1,000 rows of data per report)
Download Our Google Index Checker script from Github – Developed by our Senior Developer Alvaro Fernandez
Download/Update Node.js
Script relies on using ScraperAPI to get info from Google Search Super easy to use and you can Sign up for Free to get the API Key.
Concurrent requests limited to 5 ScraperAPI Free Plan Max limit but Al has built a function to automatically adapt concurrent to the Tier Plan limit
Unlimited number of URLs
Perfect for Clean URLs but it can also process parameterised URLs, case sensitive, international encoded characters, reserved/unreserved symbols
Recycling feature
Nice overview of the index status check when finishes
Download your XML sitemap/s using your preferred crawler (SF, DC, OC, SB)
get your list of URLs and create a urls.csv file and add it to the Google Index Checker
Once it’s finished, you will get a CSV file with your results and you can find out how much of your sitemap is indexed.
In this example I’ve taken argos.co.uk because is a large Ecom site, with a mix of normal URLs and URLs with unsafe characters.
Download your XML sitemap/s using your preferred crawler (SF, DC, OC, SB)
get your list of URLs and create a urls.csv file and add it to the Google Index Checker
Once it’s finished, you will get a CSV file with your results and you can find out how much of your sitemap is indexed.
In this example I’ve taken argos.co.uk because is a large Ecom site, with a mix of normal URLs and URLs with unsafe characters.
Download your XML sitemap/s using your preferred crawler (SF, DC, OC, SB)
get your list of URLs and create a urls.csv file and add it to the Google Index Checker
Once it’s finished, you will get a CSV file with your results and you can find out how much of your sitemap is indexed.
In this example I’ve taken argos.co.uk because is a large Ecom site, with a mix of normal URLs and URLs with unsafe characters.
We found ~160k Non-canonical category pages with a significant amount of Googlebot request
The problem was that the non-canonical URLs contained an Uppercase character which wasn’t supposed to be there.
Firstly, we wanted to identify if these pages were indexed
Secondly we wanted to know if the non-canonical URLs were being indexed instead of the canonicals
In the end we found approximately 36% of the Non-canonical URLs that were indexed instead of their canonicals.