There is a lot to cover about SEO for large websites/enterprise.
In this talk we'll cover primarily the data analysis and the technical SEO side of things. In future presentations we'll look at more.
69. Part 1: Data Studio
Part 2: Day by day data
Part 3: Python
Part 4: Data warehousing
Get
Get, Analyse
Get, Store, Analyse, Report
70. Part 1: Data Studio
Part 2: Day by day data
Part 3: Python
Part 4: Data warehousing
71. Data studio for extracting
data
● Add a data source
● Create a table for it.
● Download the table.
With both GA & GSC, you’ll get
everything in the table, no
paginating.
72. Part 1: Data Studio
Part 2: Day by day data
Part 3: Python
Part 4: Data warehousing
73. Day by day data
To get even more data we have
to get it day by day.
● bit.ly/search-console-dat
a-downloader
This bit is Search Console only.
74. Part 1: Data Studio
Part 2: Day by day data
Part 3: Python
Part 4: Data warehousing
75. Getting data from APIs
Pull down your analytics data.
● Daily_google_analytics_v3
● Getting search console
data from the API
76. Getting data from APIs
Pull down your analytics data.
● Daily_google_analytics_v3
● Getting search console
data from the API
Getting started with pandas:
● Pandas tutorial with
ranking data
77. Getting data from APIs
Pull down your analytics data.
● Daily_google_analytics_v3
● Getting search console
data from the API
Getting started with pandas:
● Pandas tutorial with
ranking data
As a workflow I’d highly
recommend Jupyter notebooks
for getting started.
● Why use jupyter
notebooks?
● SearchLove Video (paid)
78. SEO Pythonistas
A memorial and soon to be
collection of Hamlet’s excellent
work.
SEO Pythonistas - In loving
memory of Hamlet Batista
@DataChaz
79. Part 1: Data Studio
Part 2: Day by day data
Part 3: Python
Part 4: Data warehousing
104. Hi x
I’m {x} from {y} and we’ve been asked to do some log analysis to understand better how Google is behaving on the website and I was hoping you could help with some questions about the log set-up (as well as with getting the logs!).
What time period do we want?
What we’d ideally like is 3-6 months of historical logs for the website. Our goal is to look at all the different pages search engines are crawling on our website, discover where they’re spending their time, the status code errors they’re
finding etc.
We can absolutely do analysis with a month or so (we've even done it with just a week or two), but it means we lose historical context and obviously we're more likely to lose things on a larger side.
There are also some things that are really helpful for us to know when getting logs.
Do the logs have any personal information in?
We’re just concerned about the various search crawler bots like Google and Bing, we don’t need any logs from users, so any logs with emails, or telephone numbers etc. can be removed.
Can we get logs from as close to the edge as possible?
It's pretty likely you've got a couple different layers of your network that might log. Ideally we want those from as close to the edge as possible. This prevents a couple issues:
● If you've got caching going on, like a CDN or Varnish then if we get logs from after them, we won't see any of the requests they answer.
● If you've got a load balancer distributing to several servers sometimes the external IP gets lost (perhaps X-Forwarded-For isn't working), which we need to verify Googlebot or we accidentally only get logs from a couple
servers.
Are there any sub parts of your site which log to a different place?
Have you got anything like an embedded Wordpress blog which logs to a different location? If so then we’ll need those logs as well. (Although of course if you're sending us CDN logs this won't matter.)
How do you log hostname and protocol?
It's very helpful for us to be able to see hostname & protocol. How do you distinguish those in the log files?
Do you log HTTP & HTTPS to separate files? Do you log hostname at all?
This is one of the problems that's often solved by getting logs closer to the edge, as while many servers won't give you those by default, load balancers and CDN's often will.
Where would we like the logs?
In an ideal world, they would be files in an S3 bucket and we can draw them down from there. If possible, we'd also ask that multiple files aren't zipped together for upload, because that makes processing harder. (No problem with
compressed logs just, just zipping multiple log files into a single archive).
Is there anything else we should know?
Best,
{x}
107. Sampling your crawl
● Limit your crawl
percentage per template.
i.e.
● 20% to product pages
● 30% to category pages
108. Low memory crawler
Runs locally on your machine
and allows you to crawl with a
very low memory footprint.
Doesn’t render JS or process
data however.
109. Run SF in the cloud
You can purchase a super high
memory computer in the cloud,
install SF on it and run it at
maximum speed.
126. Element Equals
Title Big Brown Shoe - £12.99 - Example.com
Status Code 200
H1 Big Brown Shoe
Canonical <link rel="canonical" href="https:/
/example.com/product/big-brown-shoe" />
CSS Selector: #review-counter Any number
CSS Selector: #product-data {
"@context": "https:/
/schema.org/",
"@type": "Product",
"name": "Big Brown Shoe",
"description": "The biggest brownest show you can find.",
"sku": "0446310786",
"mpn": "925872",
}