Rick Viscomi and Paul Calvano discuss tracking the performance of the web with HTTP Archive, including how the tool works and some deeply technical case studies that show off the powerful insights to be learned.
5. 5
How it Works
● Alexa’s top 500,000 websites
○ Home pages
○ Desktop and emulated mobile
● Powered by WebPageTest
○ Records HAR trace
○ Executes custom metrics
○ Records Lighthouse audits
● httparchive.org
○ Trends and stats
○ Discussion forum
● BigQuery and Cloud Storage
○ Queryable database
○ Raw HARs
9. 9
goo.gl/kxgzM1HTTP Archive referenced in research papers
. . .
In this article we utilize
the httparchive.org
[9] publicly available
dataset of captured
web performance
metrics
. . .
Desktop and mobile web page
comparison: characteristics,
trends, and implications
IEEE Communications Magazine (
Volume: 52, Issue: 9, September 2014 )
. . .
Recent stats from
httparchive.org show
that the top 300K URLs
in the world need on
average 38(!) TCP
connections to display
the site
. . .
HTTP2 explained
Computer Communication Review 44.3
(2014): 120-128.
. . .
We make extensive use
of the [...] data
available at HTTP
Archive to expose the
characteristics of 3rd
Party assets embedded
into the top 16,000
Alexa webpages
. . .
Are 3rd Parties Slowing Down the
Mobile Web?
Proceedings of the Eighth Wireless of
the Students, by the Students, and for
the Students Workshop. ACM, 2016.
14. 14
Last Mile Acceleration (LMA)
● Gzip Compression at the Edge.
● Compression is based on HTTP Content Type
● Default Content Types were not sufficient and usually required
updating…
● When Property Manager was released, we decided to update this...
15. 15bit.ly/2hXNjbEQuerying the HTTP Archive for Compression Stats
SELECT mimeType, COUNT(*) total,
SUM(IF(resp_content_encoding = "gzip",1,0)) gzip,
SUM(IF(resp_content_encoding = "deflate",1,0)) deflate,
SUM(IF(resp_content_encoding IN("gzip", "deflate"),0,1)) NoCompression,
ROUND(
SUM(
IF(resp_content_encoding IN("gzip", "deflate"),1,0)
) / COUNT(*),2) CompressedPercentage
FROM httparchive.runs.2017_09_01_requests
GROUP BY mimeType
HAVING total > 10000
ORDER BY CompressedPercentage DESC
1
2
3
4
5
6
7
8
9
10
11
12
16. 16
Some New LMA Defaults
● text/javascript
● font/ttf
● application/javascript
● text/xml
● application/json
● application/xml
● ...
Many Content-Types that
did not match the original
defaults!
18. 18
What About Brotli?
● New compression algorithm developed by Google researchers
● 5% - 25% Reduction over Gzip Compression
● Supported by most browsers
● Let’s extend the previous query to include Brotli compression
19. 19
SELECT mimeType, count(*) total,
SUM(IF(resp_content_encoding = "gzip",1,0)) gzip,
SUM(IF(resp_content_encoding = "br",1,0)) brotli,
SUM(IF(resp_content_encoding = "deflate",1,0)) deflate,
SUM(IF(resp_content_encoding IN("gzip",”deflate",“br”),0,1)) NoCompression,
ROUND(
SUM(
IF(resp_content_encoding IN("gzip", "deflate", "br"),1,0)
) / COUNT(*),2) CompressedPercentage,
ROUND(
SUM(
IF(resp_content_encoding = "br",1,0)
) / COUNT(*),2) BrotliCompressedPercentage
FROM httparchive.runs.2017_09_01_requests
GROUP BY mimeType
HAVING total > 1000
ORDER BY brotli DESC
bit.ly/2fUK6W9Compression Stats - gzip and brotli
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
20. 20
Examining Compression By Content Type - With Brotli
Brotli use has been growing, but where is it most prevalent?
21. 21
Brotli Usage: Mostly JavaScript, CSS and HTML Resources
When we exclude Google and Facebook content, the bulk of Brotli encoded content is JS and CSS
22. 22
Compression Level = Overhead
Data on compression speeds from quixdb.github.io/squash-benchmark/#results
Most byte savings are
obtained by using the
highest compression level.
23. 23
Resource Optimizer: Automated Brotli Compression at the Edge
● Automatically compresses CSS and JS
● Resources are Compressed Offline and Cached
● Brotli Compression Level 11, without the overhead!
+ =
24. 24
Case Study #2
Server Technologies
1. Akamai Varnish Connector
2. Security Vulnerability Research
26. 26
Explore the Data…
● Schema and Preview tabs will show you what data is captured.
● For this example, we are interested in anything that will indicate whether a webpage is using
Varnish.
27. 27
These Columns Contain Some Useful Information
We are interested in firstHTML where cdn=Akamai and some of the other headers indicate Varnish is in use…
28. 28
How Many Akamai Customers Use Varnish at the Origin?
● HTTP Archive data helped determine what proportion of Akamai’s customers were using Varnish at
the origin.
● Provided a list of sites for Akamai Product Management to discuss desirable functionality with
customers.
29. 29
Investigating Security Threats
● HTTP Archive
○ Investigate other sites that contain similar characteristics
■ Server, Via, Url Regex Patterns, Other Headers
○ Export a list of sites that appear vulnerable
○ Cross Reference with Akamai Account Data
■ Notify 24x7 Security Contacts,
■ Help customers proactively protect themselves
● No CVE or Historical Attack Patterns
○ Target attack observed and mitigated (Kona Managed Security)
○ Akamai WAF rules prepared
○ Vendor notified
● Example: 0 Day Vulnerability on an Ecommerce App Server
30. 30
Taking HTTP Archive + Security to the Next Level
● HTTP Archive tracks Lighthouse reports for mobile sites
● As of Lighthouse 2.5.0 (Oct 2017), Lighthouse includes Synk security audits.
31. 31
Case Study #3
Third Party Research
1. How 3rd Parties Influence Render Time?
2. Researching a specific 3rd party
32. 32bit.ly/2xqlorDWhich 3rd Party Content Types Load Before Render Start?
SELECT mimeType,
COUNT(*) num_requests,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0)
) BeforeRenderStart,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1)
) AfterRenderStart
FROM httparchive.runs.2017_09_01_requests req
JOIN (
SELECT rank, NET.HOST(url) hostname, url, pageid, startedDateTime, renderStart
FROM httparchive.runs.2017_09_01_pages
) pages ON pages.pageid = req.pageid
WHERE NET.HOST(req.url) != pages.hostname AND rank > 0 AND rank < 100000
GROUP BY mimeType
HAVING num_requests > 1000
ORDER BY BeforeRenderStart DESC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
33. 33
What 3rd Party Content Loads Before RenderStart?
discuss.httparchive.org/t/which-3rd-party-content-loads-before-render-start/1084
34. 34bit.ly/2glVquXWhich 3rd Parties Load Before Render Start?
SELECT NET.HOST(req.url) thirdparty,
mimeType,
COUNT(*) num_requests,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0)
) BeforeRenderStart,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1)
) AfterRenderStart
FROM httparchive.runs.2017_09_01_requests req
JOIN (
SELECT rank, NET.HOST(url) hostname, url, pageid, startedDateTime, renderStart
FROM httparchive.runs.2017_09_01_pages
) pages ON pages.pageid = req.pageid
WHERE NET.HOST(req.url) != pages.hostname AND rank > 0 AND rank < 100000
GROUP BY thirdparty, mimeType
HAVING num_requests > 100
ORDER BY BeforeRenderStart DESC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
36. 36bit.ly/2xsz00PWhich Websites Load <3rd Party> Before vs After Render Time?
SELECT rank, site,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),1,0)
) BeforeRenderStart,
SUM(
IF(req.startedDateTime < (pages.startedDateTime + (renderStart/1000) ),0,1)
) AfterRenderStart
FROM httparchive.runs.2017_09_01_requests req
INNER JOIN (
SELECT rank, NET.HOST(url) site, pageid, startedDateTime, renderStart
FROM httparchive.runs.2017_09_01_pages
) pages ON pages.pageid = req.pageid
WHERE NET.HOST(req.url) LIKE "%ensighten.com%" AND rank > 0
GROUP BY rank, site
HAVING AfterRenderStart>0
ORDER BY rank ASC
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
37. 37
What 3rd Party Content Loads Before RenderStart?
● This query outputs a summary
containing the following
information:
○ Which sites use the 3rd
party?
○ How many resources are
served by it?
○ How many of them are
loaded before/after the page
renders?
● Results can help sites learn from
each other’s best practices
○ Even across industries!!!