London Web Performance Meetup - 10th November 2020
There is a lot of talk about web performance as a ranking signal in Search Engines and how important or not it is, but often people are overlooking how performance affects multiple phases of a search engine such as crawling, rendering, and indexing.
In this talk, we'll try to understand how a search engine works and how some aspects of web performance affect the online presence of a website.
Boost PC performance: How more available memory can improve productivity
Web Performance & Search Engines - A look beyond rankings
1. Web Performance & Search Engines
A look beyond rankings
2020/11/10
@giacomozecchini
2. Hi, I’m Giacomo Zecchini
Technical SEO @ Verve Search
Technical background and previous experiences in development
Love: understanding how things work and Web Performance
@giacomozecchini
4. We are going to talk about...
● How Web Performance Affects Rankings
@giacomozecchini
5. We are going to talk about...
● How Web Performance Affects Rankings
● How Search Engines Crawl and Render pages
@giacomozecchini
6. We are going to talk about...
● How Web Performance Affects Rankings
● How Search Engines Crawl and Render pages
● How It Affects Your Website
@giacomozecchini
8. Photo by Sam Balye on Unsplash
Let’s talk about the
elephant in the room
9. It’s been a while that search engines use and talk
about speed as a ranking factor
● Using site speed in web search ranking
https://webmasters.googleblog.com/2010/04/using-site-speed-in-web-search-ranking.html
● Is your site ranking rank? Do a site review
https://blogs.bing.com/webmaster/2010/06/24/is-your-site-ranking-rank-do-a-site-review-part-5-sem-101
● Using page speed in mobile search ranking
https://webmasters.googleblog.com/2018/01/using-page-speed-in-mobile-search.html
@giacomozecchini
10. Bing - “How Bing ranks your content”
Page load time: Slow page load times can lead a visitor to leave your
website, potentially before the content has even loaded, to seek
information elsewhere. Bing may view this as a poor user experience and
an unsatisfactory search result. Faster page loads are always better, but
webmasters should balance absolute page load speed with a positive,
useful user experience.
https://www.bing.com/webmaster/help/webmaster-guidelines-30fba23a
@giacomozecchini
11. Yandex - “Site Quality”
“How do I speed up my site? The speed of page loading is an
important indicator of a site's quality. If your site is slow, the user may
not wait for a page to open and switch to a different site. This
undermines their trust in your site, affects traffic and other statistical
indicators.
https://yandex.com/support/webmaster/yandex-indexing/page-speed.html
@giacomozecchini
12. Google - “Evaluating page experience for a better
web”
“Earlier this month, the Chrome team announced Core Web Vitals, a
set of metrics related to speed, responsiveness and visual stability, to
help site owners measure user experience on the web.
Today, we’re building on this work and providing an early look at an
upcoming Search ranking change that incorporates these page
experience metrics.”
https://webmasters.googleblog.com/2020/05/evaluating-page-experience.html
@giacomozecchini
14. Is speed important for ranking?
Google’s Webmaster Trends Analyst
https://twitter.com/methode/status/1255224116648476675
@giacomozecchini
15. Is speed important for ranking?
There are hundreds of ranking signals, speed is one of them but not the
most important one.
An empty page would be damn fast but not that useful.
@giacomozecchini
16. Where does Google get data from for Core Web
Vitals?
@giacomozecchini
17. Where does Google get data from for Core Web
Vitals?
● Real field data, something similar to the Chrome User Experience
Report (CrUX)
https://youtu.be/7HKYsJJrySY?t=45
@giacomozecchini
18. Where does Google get data from for Core Web
Vitals?
● Real field data, something similar to the Chrome User Experience
Report (CrUX)
Likely a raw version of CrUX that may contain all the
“URL-Keyed Metrics” that Chrome records.
https://source.chromium.org/chromium/chromium/src/+/master:tools/metrics/ukm/ukm.xml
@giacomozecchini
19. CrUX - Chrome User Experience Report
The Chrome User Experience Report provides user experience metrics
for how real-world Chrome users experience popular destinations on
the web. It’s powered by real user measurement of key user experience
metrics across the public web.
https://developers.google.com/web/tools/chrome-user-experience-report
@giacomozecchini
20. What if I’m not in CrUX?
CrUX uses a threshold related to the usage of specific websites, if there
is less data than that threshold, websites or pages are not included in the
Big Query / API database.
@giacomozecchini
21. What if I’m not in CrUX?
CrUX uses a threshold related to the usage of specific websites, if there
is less data than that threshold, websites or page are not included in the
Big Query / API database.
We can end up with:
● No data for a single page
● No data for the whole origin / website
@giacomozecchini
22. What if CrUX has no data for my pages?
@giacomozecchini
23. What if CrUX has no data for my pages?
If the URL structure is easy to understand and there is a way to split
your website into multiple parts looking at the URL, Google might
group pages per subfolder or URL structure pattern grouping URLs that
have similar content and resources.
If that is not possible, Google may use the aggregate data across whole
website.
https://youtu.be/JV7egfF29pI?t=848
@giacomozecchini
24. What if CrUX has no data for my pages?
https://www.example.com/forum/thread-1231
This URL may use the aggregate data of URLs with similar /forum/
structure
https://www.example.com/fantastic-product-98
This URL may use the subdomain aggregate data
You should remember this if planning a new website.
@giacomozecchini
25. What if CrUX has no data for my pages?
Looking at the Core Web Vitals Report in Search Console, you can
check how Google is already grouping “similar URLs” of your website.
@giacomozecchini
26. What if CrUX has no data for my website?
@giacomozecchini
27. What if CrUX has no data for my website?
This is not really clear at the moment.
@giacomozecchini
28. What if CrUX has no data for my website?
Possible solutions:
@giacomozecchini
29. What if CrUX has no data for my website?
Possible solutions:
● Not using any positive or negative value for the Core Web Vitals
@giacomozecchini
30. What if CrUX has no data for my website?
Possible solutions:
● Not using any positive or negative value for the Core Web Vitals
● Using data over a longer period of time to have enough data
(BigQuery CrUX data is aggregated on monthly base, API is using
the last 28 days of aggregated data)
@giacomozecchini
31. What if CrUX has no data for my website?
Possible solutions:
● Not using any positive or negative value for the Core Web Vitals
● Using data over a longer period of time to have enough data
(BigQuery CrUX data is aggregated on monthly base, API is using
the last 28 days of aggregated data)
● Lab data, calculating theoretical speed
@giacomozecchini
32. What if CrUX has no data for my website?
We might have more information on this when Google will start using
Core Web Vitals in Search (May, 2021).
https://webmasters.googleblog.com/2020/11/timing-for-page-experience.html
@giacomozecchini
37. What about AMP?
● AMP is not a ranking factor, never has been
@giacomozecchini
38. What about AMP?
● AMP is not a ranking factor, never has been
● Google will remove the AMP requirement from Top Stories
eligibility in May, 2021
https://webmasters.googleblog.com/2020/05/evaluating-page-experience.html
@giacomozecchini
40. We can split what a Search Engine does in two
main parts
● What happens when a user search for something
● What happens in the background ahead of time
@giacomozecchini
41. What happens when a user searches for something
When a Search Engine gets a query from a user, it starts processing that
trying to understand the meaning behind that search, retrieving and
scoring the documents in the index, and eventually serving a list of results
to the user.
@giacomozecchini
42. What happens in the background ahead of time
To be able serving to users pages that match their queries, a search
engine has to:
@giacomozecchini
43. What happens in the background ahead of time
To be able serving to users pages that match their queries, a search
engine has to:
● Crawl the web
@giacomozecchini
44. What happens in the background ahead of time
To be able serving to users pages that match their queries, a search
engine has to:
● Crawl the web
● Analyse crawled pages
@giacomozecchini
45. What happens in the background ahead of time
To be able serving to users pages that match their queries, a search
engine has to:
● Crawl the web
● Analyse crawled pages
● Build an Index
@giacomozecchini
49. Even if your pages are being crawled, it
doesn't mean they will be indexed.
Having your pages indexed doesn't mean
they will rank.
@giacomozecchini
50. Crawler
“A Web crawler, sometimes called a spider or spiderbot and often
shortened to crawler, is an Internet bot that systematically browses the
World Wide Web, typically for the purpose of Web indexing (web
spidering).”
https://en.wikipedia.org/wiki/Web_crawler
@giacomozecchini
51. Crawler
Features it must have:
● Robustness
● Politeness
@giacomozecchini
Features it should have:
● Distributed
● Scalable
● Performance and efficiency
● Quality
● Freshness
● Extensible
52. Crawler
Features it must have:
● Robustness
● Politeness
@giacomozecchini
Features it should have:
● Distributed
● Scalable
● Performance and efficiency
● Quality
● Freshness
● Extensible
53. Crawler - Politeness
Politeness can be:
● Explicit - Webmasters can define what portion of site can be
crawled using the robots.txt file
https://tools.ietf.org/html/draft-koster-rep-00
@giacomozecchini
54. Crawler - Politeness
Politeness can be:
● Explicit - Webmasters can define what portion of site can be
crawled using the robots.txt file
● Implicit - Search Engines should avoid requesting any site too often,
they have algorithms to determine the optimal crawl speed for a
site.
@giacomozecchini
55. Crawler - Politeness - Crawl Rate
Crawl Rate defines the max number of parallel connections and the min
time between fetches.
Together with the Crawl Demand (Popularity + Staleness) is part of
the Crawl Budget.
https://webmasters.googleblog.com/2017/01/what-crawl-budget-means-for-googlebot.html
@giacomozecchini
56. Crawler - Politeness - Crawl Rate
Crawl Rate is based on the Crawl Health and the limit you can manually
set in Search Console.
Crawl Health is depending on the server response time.
If the server is fast to answer, the crawl rate goes up. If the server slows
down, starts emitting a significant number of 5xx errors or connection
timeouts, crawling slows down.
@giacomozecchini
57. Crawler - Performance and Efficiency
A crawler should make efficient use of resources such as processor,
storage, and network bandwidth.
@giacomozecchini
65. A crawler should make efficient use of resources, using HTTP
persistent connection, also called HTTP Keep-Alive connection,
helps keeping robots (or threads) busy and saving time.
Reusing the same TCP connection gives crawlers some advantages such
as less latency in subsequent requests, less CPU usage (no multiple TLS
handshakes), and reduced network congestion.
@giacomozecchini
72. Crawler - HTTP/1.1 and HTTP/2
When I first started writing this presentation all most popular Search
Engines crawlers weren’t using HTTP/2 to make requests.
@giacomozecchini
73. Crawler - HTTP/1.1 and HTTP/2
I was also remembering a tweet from Google’s John Mueller:
@giacomozecchini
74. Crawler - HTTP/1.1 and HTTP/2
Instead of thinking “How can crawlers benefit from using HTTP/2?”, I
started my research from the (wrong) conclusion: crawlers have no
advantages in using HTTP/2.
@giacomozecchini
75. Crawler - HTTP/1.1 and HTTP/2
Instead of thinking “How can crawlers benefit from using HTTP/2?”, I
started my research from the (wrong) conclusion: crawlers have no
advantages in using HTTP/2.
But then Google published this article:
Googlebot will soon speak HTTP/2.
https://webmasters.googleblog.com/2020/09/googlebot-will-soon-speak-http2.html
@giacomozecchini
76.
77. Crawler - HTTP/1.1 and HTTP/2
How can crawlers benefit from using HTTP/2?
From the Article: Some of the many, but most prominent benefits in
using H2 include:
● Multiplexing and concurrency
● Header compression
● Server push
@giacomozecchini
78. Crawler - HTTP/1.1 and HTTP/2
Multiplexing and concurrency
What they were achieving using multiple robots (or threads) each one
with a single HTTP/1.1 connection will be possible using a single (or less)
HTTP/2 connection with multiple parallel requests.
Crawl Rate HTTP/1.1: max number of parallel connections
Crawl Rate HTTP/2: max number of parallel requests
@giacomozecchini
79. Crawler - HTTP/1.1 and HTTP/2
Header Compression
HTTP/2 HPACK compression algorithms will reduce the amount of
HTTP header sizes saving bandwidth.
HPACK is even more effective for crawlers than browsers. Crawlers are
stateless using mostly the same HTTP headers for request over and over
and they might also request multiple pages (and assets) in one H2
connection.
@giacomozecchini
80. Crawler - HTTP/1.1 and HTTP/2
Server push
“This feature is not yet enabled; it's still in the evaluation phase. It may
be beneficial for rendering, but we don't have anything specific to say
about it at this point.”
@giacomozecchini
81. Crawler - HTTP/1.1 and HTTP/2
Server push
“This feature is not yet enabled; it's still in the evaluation phase. It may
be beneficial for rendering, but we don't have anything specific to say
about it at this point.”
Google is making massive use of caching and this seems to be a really
good reason to not use server push. I guess they will probably never
enable this.
@giacomozecchini
82. Crawler - HTTP/1.1 and HTTP/2
Server push
We are too often looking at protocols in a browser-centric way,
forgetting that other people might use a specific feature in a beneficial
way.
E.g. Rest API and server push
@giacomozecchini
83. Crawler - HTTP/1.1 and HTTP/2
Why it took Google so long to approach HTTP/2?
● Widely support and maturation of the protocol
● Code complexity
● Regression testing
@giacomozecchini
84. WRS (Web Rendering Service)
Google is using a Web Rendering Service in order to render pages for
Search. It’s based on the Chromium rendering engine and is regularly
updated to ensure support for the latest web platform features.
https://webmasters.googleblog.com/2019/05/the-new-evergreen-googlebot.html
@giacomozecchini
86. WRS
● Doesn’t obey HTTP caching rules
WRS caches every GET request for an undefined period of time (it
uses an internal heuristic)
@giacomozecchini
87. WRS
● Doesn’t obey HTTP caching rules
● Limits the number of fetches
WRS might stop fetching resources after a number of requests or a
period of time. It may not fetch known Analytics software.
@giacomozecchini
88. WRS
● Doesn’t obey HTTP caching rules
● Limits the number of fetches
● Built to be resilient
WRS will process and render a page even if some fetches fails
@giacomozecchini
89. WRS
● Doesn’t obey HTTP caching rules
● Limits the number of fetches
● Built to be resilient
● Might interrupt scripts (excessive CPU usage, error loops, etc)
@giacomozecchini
100. Cache and Rendering
WRS is caching everything without respecting HTTP caching rules.
Using fingerprinting for file names and defining a cache busting strategy is
the way to go: bundle.ap443f.js
E.g. bundle.js will be cached for an undefined period of time (days,
weeks, months) and will be used for rendering even if you change the
code.
@giacomozecchini
101. Crawl Rate and Rendering
Crawl Rate is shared between crawlers and even those requests that
crawler makes on behalf of WRS don’t make an exception. If the server
slows down during rendering, Crawl Rate will decrease and rendering
may fail.
Btw, rendering is quite resilient and it may retry later.
Tip: Monitor server response time.
@giacomozecchini
102. Politeness and Rendering
Robots.txt can block a crawler from requesting a specific part of a
website. What can go wrong?
● If you are blocking a specific file, it won’t be fetched and used
● If you have a JS script with a fetch/retry loop of a resource that is
blocked from rule in your robots.txt, that script will be interrupted
@giacomozecchini
103. CPU usage and Rendering
WRS limits CPU consumption and can block excessive script run.
Performance matters: you should analyse runtime performance, debug
issues and remove bottlenecks.
@giacomozecchini
104. Third-party stuff
Third-party can cause a few problems:
● Resources can be blocked through robots.txt on their domains
● Request timeouts, connection errors
@giacomozecchini
105. Cookies
Cookies, local storage and session storage are enabled but cleared
across page loads.
If you are checking the presence of a specific cookie to redirect or not a
user to a welcome page, WRS won’t be able to render those pages.
@giacomozecchini
106. Service Workers and Rendering
Service Worker registration promises are refused.
Web Workers are supported.
@giacomozecchini
107. Service Workers and Rendering
Service Worker registration promises are refused.
@giacomozecchini
109. Render Queue and Rendering
Google states that the Render Queue median time is ~5 seconds.
In the past this wasn't true and pages were waiting hours/days to be
rendered. This might still be true for other search engines.
@giacomozecchini
110. Render Queue and Rendering
I believe Google reduced Render Queue time for two main big reasons:
● Freshness
● Errors with assets / dependencies
@giacomozecchini
111. Render Queue and Rendering
When the crawler first requests a page, it tries to get and cache also
visible assets on that page.
During the rendering phase, the bundle.js dependencies are discovered,
requested and cached.
@giacomozecchini
HTML JS
112. Render Queue and Rendering
But, if you delete the dependencies of bundle.js before the rendering
phase, they can’t be fetched even if bundle.js is cached.
I guess this was happening a lot in the past but it shouldn’t happen
anymore at least in Google’s WRS, as the time span between the two
phases is very short. Not sure about other search engines yet.
TIP: keep old assets for a bit, even if not using those anymore.
@giacomozecchini
113. Browser Events and Rendering
WRS Chrome instances don’t scroll or click, if you want to use
Javascript lazy load functionalities use the Intersection Observer.
WRS Chrome instances start rendering pages with two fixed viewports
for mobile (412 x 732) and desktop (1024 x 1024).
And then, they increase the viewport height size to a very big number of
pixels (tens of thousands), that is dynamically calculated on a page base.
@giacomozecchini
117. Debugging Rendering problems
In the “page resource” tab you shouldn't worry if there are error for
FONTs, IMAGEs and Analytics Js files. Those file are not requested in
the rendering phase.
@giacomozecchini
118. Debugging Rendering problems
If you haven’t Search Console access, you can use Mobile-Friendly Test.
WARNING
Mobile-Friendly Test, Search Console Live Test, AMP Test, and Rich Results
Test are using the same infrastructure as WRS, but bypassing cache and using
stricter timeouts than Googlebot / WRS, final results can be very different.
https://youtu.be/24TZiDVBwSY?t=816
@giacomozecchini