Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
By @FrancoisGoube, CEO @Oncrawl
What I learned crawling 10
billion URLs
and analyzing
5 trillion log lines
Most Advanced
Technical SEO
Data Platform
I ❤️ to Read
Google Patents
I RUN
TESTS
WITH
DATA
“Au Menu”
Insights / Fun Facts / Weird facts
Best practices
Demistify some SEO Myths
~5 Trillions
of Log lines
Pieces of Data I grabbed : Crawl & Logs
1
I crawled 250k random websites from the
Majestic Milli...
I needed to map the websites
I asked our Engineers to run a Machine Learning Model based on
gradient boosting: regression ...
What we learned from Log file analysis
Internet Traffic is made of bots
DataSet: Top100 websites when 230 websites are ordered by TF
Internet Traffic is made of bots
DataSet: Top100 websites when 230 websites are ordered by TF
Internet Traffic is made of bots
DataSet: Top100 websites when 230 websites are ordered by TF
Things to know about crawl budget
Crawl budget looks like a Zero-sum game
Your Paid Campaigns might hurt your crawl budget...
Freshrank is the average timeframe between two events:
1. Google crawled the page for the first time
2. Google sent the f...
How to know if you are being migrated to the Mobile First Index ?
Simply look at Web vs Mobile bots hits from Googlebots
S...
Are my competitors already in? What’s the state of my market?
State of the Mobile-First Index
Yes UK Won…
State of the Mobile-First Index
How often Googlebot renders JS?
We checked only websites without any pre-rendering solutions
(yes there are some…)
On aver...
Insights from Combined Analysis
Analytics vs Crawl vs Logs vs Rankings
Best ranked pages are not always the most crawled by Google
Do people really care about 3XX, 4XX & 5XX?
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Ecommerce
Classifieds
Media
Ecomme...
Impact of 3XX or 4XX on Googlebots
The errors & redirects encountered
by Googlebots directly impacts
your Crawl Budget and...
Impact of 5xx errors
0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000
Lost more than 5 positions
Lost less...
Best SEO Trick of the Year: Cloaking 503
When migrating your website simply cloak your pages for
Googlebots with a 503 (se...
The state of AMP
Ecommerce News Classifieds
0,0004% 0,007% 0,0002%
DataSet: 1.4 Billions compliant pages
% of pages implem...
 Crawl frequency on AMP Pages is always higher than any other pagetype
 AMP Pages take a huge part of your crawl budget
...
The use of structured data
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Ecommerce Classifieds Media Ecommerce Niche Player ...
The real impact of Structured Data
Pages with Structured
Data get Rich Snippet
And Way better CTR!!
This is a good way to ...
Guessing and Weird facts
Correlation between Payload & Crawl Ratio
Ecommerce News Classifieds
0.6 0.2 0.8
Classifieds Websites
Looks like Google knows very well each category
Niche Ecommerce players
Looks like Google knows very well each category
Let’s look at content size:
And behaves differently depending on website category
for each ranking factor
Ecommerce
Class...
Google does not treat your near duplicates the same way
depending on how you handled your canonicals
The weird behaviour ...
Distribution of
Pages in structure > crawled > ranked > active
8%
19%
26%
32%
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
...
Thank You!
@OnCrawl
francois@oncrawl.com
1 month Free Trial
Prochain SlideShare
Chargement dans…5
×

Technical SEO: What I learned from crawling 10 billions of Pages and analysing 5 Trillions of log lines

6 266 vues

Publié le

Slides from Francois Goube's talk at BrightonSEO September 2018. Francois is the Founder of Technical SEO data Platform Oncrawl.com
You will find actionnable insights about Googlebots behavior from Crawl Budget to the influence of Ranking factors vs the type of website you have. Fun facts, weird facts about Technical SEO myths.

Publié dans : Internet
  • Soyez le premier à commenter

Technical SEO: What I learned from crawling 10 billions of Pages and analysing 5 Trillions of log lines

  1. 1. By @FrancoisGoube, CEO @Oncrawl What I learned crawling 10 billion URLs and analyzing 5 trillion log lines
  2. 2. Most Advanced Technical SEO Data Platform
  3. 3. I ❤️ to Read Google Patents
  4. 4. I RUN TESTS WITH DATA
  5. 5. “Au Menu” Insights / Fun Facts / Weird facts Best practices Demistify some SEO Myths
  6. 6. ~5 Trillions of Log lines Pieces of Data I grabbed : Crawl & Logs 1 I crawled 250k random websites from the Majestic Million up to Pagedepth Level 5 I used the data of 97 Oncrawl customers with their agreement (sites from 10k to 100M+ Urls)2 =~8B urls =~2B urls ~10B urls 3 I look deeply into the Logs Data from these 97 customers ~230 websites over 365 days =
  7. 7. I needed to map the websites I asked our Engineers to run a Machine Learning Model based on gradient boosting: regression trees to classify websites Distribution by number of websites Distribution by number of URLs
  8. 8. What we learned from Log file analysis
  9. 9. Internet Traffic is made of bots DataSet: Top100 websites when 230 websites are ordered by TF
  10. 10. Internet Traffic is made of bots DataSet: Top100 websites when 230 websites are ordered by TF
  11. 11. Internet Traffic is made of bots DataSet: Top100 websites when 230 websites are ordered by TF
  12. 12. Things to know about crawl budget Crawl budget looks like a Zero-sum game Your Paid Campaigns might hurt your crawl budget. Same behavior is observed on 92% of websites in the test Standard Day Paid Campaign Day DataSet: 230 Oncrawl monitored Websites
  13. 13. Freshrank is the average timeframe between two events: 1. Google crawled the page for the first time 2. Google sent the first Organic visit How much time to get a visit on a new page? 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 Ecommerce Classifieds Media Ecommerce Niche Player Media Niche Player Average Freshrank (days) 0.1 1.0 10.0 100.0 1000.0 Ecommerce Classifieds Media Ecommerce Niche Player Media Niche Player Average Freshrank vs PageDepth PageDepth 1 to 3 PageDepth 3 to 5 Over PageDepth 5 Average Freshrank DataSet: 230 Oncrawl monitored Websites If you want to drive quickly Organic Traffic, be accurate on PageDepth Level Analysis
  14. 14. How to know if you are being migrated to the Mobile First Index ? Simply look at Web vs Mobile bots hits from Googlebots State of the Mobile-First Index
  15. 15. Are my competitors already in? What’s the state of my market? State of the Mobile-First Index
  16. 16. Yes UK Won… State of the Mobile-First Index
  17. 17. How often Googlebot renders JS? We checked only websites without any pre-rendering solutions (yes there are some…) On average Googlebots rendering JS are crawling these websites every 24 days
  18. 18. Insights from Combined Analysis Analytics vs Crawl vs Logs vs Rankings
  19. 19. Best ranked pages are not always the most crawled by Google
  20. 20. Do people really care about 3XX, 4XX & 5XX? 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Ecommerce Classifieds Media Ecommerce Niche Player Media Niche Player 3XX 4XX 5XX Indexable URLs Not Compliant URLs
  21. 21. Impact of 3XX or 4XX on Googlebots The errors & redirects encountered by Googlebots directly impacts your Crawl Budget and how Google sees your website
  22. 22. Impact of 5xx errors 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 4000000 Lost more than 5 positions Lost less than 5 positions No loss Gain Positions 5XX errors on pages ranking first page over the last 30 days We never find any direct correlation between 5XX errors and Bots behavior… …But you might loose some rankings DataSet: 230 Oncrawl monitored Websites: GSC vs Logs
  23. 23. Best SEO Trick of the Year: Cloaking 503 When migrating your website simply cloak your pages for Googlebots with a 503 (service unavailable). Googlebot will come back later and won’t index your pages. Cloaking is not a Crime
  24. 24. The state of AMP Ecommerce News Classifieds 0,0004% 0,007% 0,0002% DataSet: 1.4 Billions compliant pages % of pages implementing AMP
  25. 25.  Crawl frequency on AMP Pages is always higher than any other pagetype  AMP Pages take a huge part of your crawl budget  Most advanced players (media) maintain a flat number of AMP Pages (~5% of their pages / Rule depending on Pubdate) Interesting Facts AMPNot AMP
  26. 26. The use of structured data 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Ecommerce Classifieds Media Ecommerce Niche Player Media Niche Player Use of Schema.org on Articles / Product Pages / Ads 2+ Schema Types Only 1 Schema type 0 Schema types DataSet: Only Product pages / Article Pages / Ads ~900M pages
  27. 27. The real impact of Structured Data Pages with Structured Data get Rich Snippet And Way better CTR!! This is a good way to start predicting your SEO ROI
  28. 28. Guessing and Weird facts
  29. 29. Correlation between Payload & Crawl Ratio Ecommerce News Classifieds 0.6 0.2 0.8
  30. 30. Classifieds Websites Looks like Google knows very well each category
  31. 31. Niche Ecommerce players Looks like Google knows very well each category
  32. 32. Let’s look at content size: And behaves differently depending on website category for each ranking factor Ecommerce Classifieds
  33. 33. Google does not treat your near duplicates the same way depending on how you handled your canonicals The weird behaviour on Nearduplicates w/ bad canonicals
  34. 34. Distribution of Pages in structure > crawled > ranked > active 8% 19% 26% 32% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Active pages in structure Pages in structure that are ranked in Google Pages in structure crawled by Google Pages in structure Use the Lookalike method to spot common caracteristics of PageGroups  What are the pages with similar metrics to Active pages ones
  35. 35. Thank You! @OnCrawl francois@oncrawl.com
  36. 36. 1 month Free Trial

×