10. How do we try and increase crawl frequency?
Increase External link count (includes links from social sites)
List valuable pages in sitemaps and ping Google
Increase Internal link count (crawl paths)
Create new pages, and update older pages (avoid stagnation)
Ensure pages are unique, reduce internal duplication
Avoid internally linking to redirects or broken pages
Testing. Lots of testing.
11. What actions do SEOs take from log analysis?
● Optimize Googlebot crawl
○ restructure link architecture, apply directives, block via robots.txt
● Find server errors or Googlebot induced errors
○ Try to fix any 4xx, 5xx error codes
○ Use browser user agent referer fields to uncover source of errors
● Understand Googlebot crawl rate & behaviour for SEO testing
○ Helpful for testing and insights and constantly questioning best practices
● Block badly behaving bots, prevent bandwidth drain
○ Look for hotlinking bandwidth drain, i.e images from porn sites
● Find unreported links through referer fields
○ Link crawlers don’t find every link, server logs are necessary for comprehensive audits
● Double check Analytics data
○ Helpful for correcting analytics setup or understanding why referers aren’t passed correctly
13. Step 1: Get the right fields logged
206.248.146.167 - - [25/Aug/2015:06:50:01 +0000] "GET /shoes HTTP/1.0" 200 251
"https://www.google.ca/" “example.com” "Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
User agent
IP Address Date/Time
Referer
Method
Response code
Page
Response time
Hostname
14. Step 2: Ensure the correct originating IP is logged
Load balancers, proxies or CDN’s may overwrite the original IP of the request. Use
X-Forwarded-For header for to ensure you have the original IP
IIS: http://www.loadbalancer.org/blog/iis-and-x-forwarded-for-header
Apache: http://www.loadbalancer.org/blog/apache-and-x-forwarded-for-headers
Nginx: https://easyengine.io/tutorials/nginx/forwarding-visitors-real-ip/
CloudFlare:
https://support.cloudflare.com/hc/en-us/sections/200805497-Restoring-Visitor-IPs
15. Step 3: Ensure we have all of the logs
● Triple check the hostname! If you’re analyzing example.com, desktop for
instance, ensure you’re not counting the mobile version (m.example.com) or
other subdomains (forum.example.com). Be very careful to get the right data
or you will pull your hair out. Ask system administrators!
● If the server stores cached copies and serves them from another server, get
those logs too and combine them for the target domain analysis.
● Too much data? Ask for selective logging for Googlebot user agent only
16. Step 4: Parse the logs, grab Googlebot entries
https://www.splunk.com/en_us/download/splunk-light-2.html
17. Step 5: Verify Googlebot entries by DNS
1. Segment out logs with user-agent: Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)
2. Take the original IP in the logs, example: 66.249.65.63
3. Reverse DNS lookup: crawl-66-249-65-63.googlebot.com
4. DNS lookup: 66.249.65.63 (confirmed!)
https://support.google.com/webmasters/answer/80553?hl=en
Software I use: http://www.nirsoft.net/utils/ipnetinfo.html
18. Note to myself: Look out for Google Mobile user
agents
Mozilla/5.0+(iPhone;+CPU+iPhone+OS+6_0+like+Mac+OS+X)+AppleWebKit/536.26
+(KHTML,+like+Gecko)+Version/6.0+Mobile/10A5376e+Safari/8536.25+(compatible;
+Googlebot/2.1;++http://www.google.com/bot.html)
This is a verified Googlebot from 66.249.65.63, but it’s not listed on the official
crawlers page.
Official Google: Mobile-first Indexing
19. Step 5: Merge Crawl data with clean logs
● Crawl as: Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html) and a popular browser user agent
● Crawler config: Disobey Robots.txt, crawl all non-HTML, crawl internal
nofollow, crawl canonicals & sitemaps, ideally JS enabled
● Fields required: URL, Response code, Title, Robots directives (blocked,
noindex, nofollow etc.), Canonical, Page size, response time, crawl level,
number of internal links to page
Try DeepCrawl for free bit.ly/freecrawl - 25,000 credits for Untagged.io
20. Step 6: Add Web Analytics data
● Ensure the the URLs correspond correctly (special characters, full URL)
● Ensure the date period is exactly the same period as server logs
● Use data from source/medium = Google/Organic only
DeepCrawl can do merge both crawl and analytics data from Google Analytics
21. So far...
● Have all logs from the right host with the right
fields
● Have the original IP addresses
● Confirmed real Googlebot visits
● Merged crawl data and analytics data perfectly
22. Just when you think all the data is correct,
something will go wrong, guaranteed ;)
23. Real example, small site:
http://www.campgroundsigns.com/
7 million events from load balancer, IIS custom format access logs= 1.6 gB of
data
13,000 Googlebot events over 28 days
1,129 pages are indexable on
campgroundsigns.com
24. Caveat!
The following are observations based on 1 small website. The
observations for this site are only for this site and are not
representative.
Each website and it’s Googlebot crawl activity are different.
Special thanks to campgroundsigns.com for volunteering for
the analysis
35. Did Google crawl the right pages?
Indexable defined as: Response code: 200, no robots.txt block, self referencing canonical or no canonical in head or http header, no noindex
directives in head or http header, no directives applied in GSC param config, no removal request, not JS/CSS or resource files. Not indexable either
has non 200 response or one of the previous.
36. Generally, we see reduced crawl activity to
pages with NOINDEX.
There’s something wrong.
PLA = Product listing Ad.
37. We tried to block the PLA
pages to divert attention
to important pages:
38. Based on 4 day, Mon-Thursday period before and after the block
Errr, go back, quick.
All requests Unique pages crawled
Before After Before After
PLA (Blocked by robots) 1334 0 703 0
Department or other Page 404 212 270 124
Product page 605 247 452 177
resource 332 406 50 61
Homepage 15 15 1 1
Totals 2690 880 1476 363
Difference -67% -75%
39. Turns out, Google uses their regular Googlebot
crawler to crawl them, not Adbot.
It was a mistake blocking these. We’ll try
canonicals next.
https://support.google.com/merchants/answer/160156?hl=en
51. More realistic, still estimated, but slightly less
bullshit:
● 766 unique, indexable pages were crawled over 28 days
● That gives us an Average of 27 unique pages crawled
per day.
● 1129 total indexable pages / 27 = minimum 42 days for a
full recrawl.
Remember, this is a complete estimate.
52. That doesn’t even account for how many times
Google has to figure out a 301 redirect.
53. Same calculation, different site (with approx 86,000
indexable pages)
This is not representative of any other site.
57. If it seems Google isn’t respecting robots.txt,
check:
10 day
lag!
58. Server log analysis is hard. Here’s why:
● Data size challenges, example: 7 million events = 1.6 gB (and that’s tiny)
● Lots of different servers logging with custom formats
● Often, obtaining them means surpassing people problems & technical
challenges
● Any small mistakes combining crawl, analytics, search console data can make
the entire analysis useless
● Combining large datasets requires either some form of programming or
technical knowledge; it’s not for everyone.
● Many available tools aren’t comprehensive enough for SEO purposes yet.
That being said, they are the best thing since patatas bravas
con alioli.
59. Things that can corrupt your results
● Thinking you’re seeing Googlebot but it’s not really Googlebot
● Not accounting for robots.txt restrictions changes or other directive changed in
crawl data during logging period
● Incorrect field mapping, i.e. mistake referer for page request
● Incorrect merging of crawl and analytics data
60. Helpful links for log analysis
Guides:
● A Complete Guide to Log Analysis with Big Query - Dominic Woodman
● The Ultimate Guide to Log File Analysis - Daniel Butler
● SEO Finds in Your Server Log Tim Resnik
● How to Use Server Log Analysis for Technical SEO Samuel Scott
Software:
● Splunk
● SEO Log File Analyser
● Logz.io
● Botify