Things can add up over time when you migrate sites or have many legacy domains, subdomains and old code in a website. Signs of poor quality add up as incremental crawling never stops. This is akin to SEO technical debt which you need to repay to regain good site health and positive quality signals. You can't repay the debt all at once, but in iterative incremental steps over time.
9. @dawnieando from
@MoveItMarketing #StateOfSearch
RECKLESS
DEBT PRUDENT
DEBT
DELIBERATE
INADVERTENT
DEBT
MARTIN
FOWLER
TECHNICAL
DEBT
QUADRANT
“We
must
launch
now
and
deal
with
consequences”
“Now
we
know
how
we
should
have
done
it”
“We
don’t
have
time
for
design”
“What’s
layering?”
Credit:
Martin
Fowler
Technical
Debt
Quadrant
10. @dawnieando from
@MoveItMarketing #StateOfSearch
RECKLESS
DEBT PRUDENT
DEBT
DELIBERATE
DEBT
INADVERTENT
DEBT
SEO
TECHNICAL
DEBT
QUADRANT
“SEO
is
dead
/
doesn’t
matter”
“What’s
a
URL
parameter?
“What’s
a
canonical?
“
What’s
internationalization?”
S
E
O
T
E
C
H
N
I
C
A
L
D
E
B
T
“Now
we
know
how
we
should
have
done
it”
(Further
learnings
were
discovered
as
knowledge
grew)
“We
must
launch
now
and
deal
with
SEO
issues
after”
”We’ll
SEO
‘it’
later”
PRUDENT
DEBT
11. Project Success ===
Produce planned
deliverables, within
budget, on time
(including approved
changes)
Source:
http://4pm.com/2015/09/27/project-‐failure/
14. @dawnieando from
@MoveItMarketing #StateOfSearch
SEOs
WAIT
A
LONG
TIME
FOR
DEV
CHANGES
https://moz.com/blog/how-‐long-‐are-‐seos-‐waiting-‐for-‐their-‐most-‐important-‐changes
Over
40%
wait
12
months+
to
get
their
most
crucial
SEO
changes
implemented
15. @dawnieando from
@MoveItMarketing #StateOfSearch
MAJOR
CAUSES
OF
THIS
https://moz.com/blog/how-‐long-‐are-‐seos-‐waiting-‐for-‐their-‐most-‐important-‐changes
“Legacy
technology
or
outdated
processes
hampering
progress”
“The
change
they
want
is
“not
possible”
with
current
platform
(37%)”
Source:
(Will
Critchlow
Distilled
Research,
on
Moz Blog,
May
2016)
28. @dawnieando from
@MoveItMarketing #StateOfSearch
“If
“change”
means
“any
change”,
then
about
40%
of
all
web
pages
change
weekly
[12].
Even
if
we
consider
only
pages
that
change
by
a
third
or
more,
about
7%
of
all
web
pages
change
weekly
[17].”
(Broder,
A.Z.,
Najork,
M.
and
Wiener,
J.L.,
2003)
EVEN
AS
FAR
BACK
IN
2003
40% of ALL web pages
changed weekly
___________________
7%
of
web
pages
changed
a
1/3
of
their
page
content
or
more
weekly
29. @dawnieando from
@MoveItMarketing #StateOfSearch
HOW
MUCH
BIGGER
&
DYNAMIC
IS
THE
WEB
NOW
IN
2017?
http://www.internetlivestats.com/total-‐number-‐of-‐websites/
30. @dawnieando from
@MoveItMarketing #StateOfSearch
INCREMENTAL CRAWLING NEVER ENDS
“Crawling
method
based
on
crawl
frequency
based
on
URL
historical
change
&
importance
rate”
Crawling
Which
Never
Ends
Ongoing
31. @dawnieando from
@MoveItMarketing #StateOfSearch
CRAWLING
FRONTIER
Shestakov,
D.,
2013,
July.
Current
challenges
in
web
crawling.
In International
Conference
on
Web
Engineering (pp.
518-‐
521).
Springer,
Berlin,
Heidelberg.
35. @dawnieando from
@MoveItMarketing #StateOfSearch
PAST DATA IS A GREAT PREDICTOR
OF FUTURE DATA
PREDICTION
BASED
PRIORITY
SCHEDULING
…
WHEN
THERE
IS
CONSISTENCY
36. @dawnieando from
@MoveItMarketing #StateOfSearch
‘Sampling’ in Crawling for Efficiency
‘SMALL
TEST
VISITS
TO
A
SITE
TO
UNDERSTAND
WHETHER
IT
IS
WORTH
CRAWLING
&
UNDERSTAND
URL
PATTERNS
&
RESOURCES
THERE’
38. @dawnieando from
@MoveItMarketing #StateOfSearch
DUSTBUSTER & DUST CRAWLING RULES
DO
NOT
CRAWL
IN
THE
DUST
BUILDS
‘HINTS’
ON
WHAT
NOT
TO
CRAWL
EVERY
SITE
WILL
HAVE
ITS
OWN
CRAWLING
RULES
39. @dawnieando from
@MoveItMarketing #StateOfSearch
Popular CMS ’Rule Patterns’ (URL Parameters)
ALL
WILL
HAVE
COMMON
CANONICALIZATION
PATTERNS
WHICH
CAN
BE
LEARNED
40. @dawnieando from
@MoveItMarketing #StateOfSearch
Every Version of Your Past Ecommerce Sites
“Exponentially
multiplicative
URLs”
Had
potential
to
spew…
at
some
point…
DIFFERENT
PARAMETERS
&
URL
PATTERNS
WHICH
ARE
LEARNED
BY
CRAWLERS…
AND
REMEMBERED…
FOREVER
45. @dawnieando from
@MoveItMarketing #StateOfSearch
The
hottest
job
on
the
block
at
one
point
Once
described
by
W3C
Schools
as
‘The
Developers
Dream’
LEGACY
CODE
BASES
&
DEPRECATED
VERSIONS
49. @dawnieando from
@MoveItMarketing #StateOfSearch
GUTENBERG
SOURCE:
https://speckyboy.com/meet-‐greg-‐schoppe-‐developer-‐gutenberg/
“WordPress
Core
is
a
minefield
of
design
decisions
that
were
made
for
what
WordPress
was
at
the
time,
and
didn’t
age
well”
(Greg
Schoppe,
2017)
50. @dawnieando from
@MoveItMarketing #StateOfSearch
https://managewp.com/statistics-‐about-‐wordpress-‐usage
Wordpress now
powers
26%
of
the
web
HUGE
EXAMPLE
OF
GENERATIONAL
SOFTWARE
CRUFT
52. @dawnieando from
@MoveItMarketing #StateOfSearch
WELL…
WE
DID
MAKE
QUITE
A
BIT
OF
CONTENT
http://www.internetlivestats.com/total-‐number-‐of-‐websites/
54. @dawnieando from
@MoveItMarketing #StateOfSearch
Poor
quality
content
signals
build
up
over
time…
incremental
crawling
just
keeps
on
rolling
and
crawling…
and
gathering
signals
55. @dawnieando from
@MoveItMarketing #StateOfSearch
Source:
https://plus.google.com/u/0/+GlennGabe/posts/fXZw2BuSa5B
SIGNALS
OF
LOW
QUALITY
JUST
KEEP
COMPOUNDING
OVER
TIME
61. @dawnieando from
@MoveItMarketing #StateOfSearch
SEMANTIC
LOSS
WONKY TOPICAL STRENGTH
HOW
MUCH
STUFF
DID
YOU
MOVE
AROUND
OVER
THE
YEARS?
62. @dawnieando from
@MoveItMarketing #StateOfSearch
YOU BROKE YOUR SILO STRUCTURE
Image
credit:
https://www.slideshare.net/patrickstox/nlp-‐sitemap-‐smx-‐2016-‐
patrick-‐stox-‐latest-‐in-‐advanced-‐technical-‐seo
SEMANTIC
LOSS
63. YOU
BROKE
YOUR
CORPUS
‘RELATEDNESS’
1st
level
relatedness
2nd
level
relatedness
MANY
SIGNALS
GONE
64. “You shall know a
word by
the company
it keeps”
(Firth,
1957)
(ITS
CO-‐OCCURRENCE
VECTOR)
71. @dawnieando from
@MoveItMarketing #StateOfSearch
FUZZY LOGIC – DEGREES OF TRUTH
0.8
Doc
ID
likely
to
be
a
correct
URI
to
choose
from
term
/
query
cluster
72. @dawnieando from
@MoveItMarketing #StateOfSearch
Semantics
&
concepts
relatednes may
be
‘secret
sauce’
when
it
comes
to
’precision’
over
‘recall’
73. @dawnieando from
@MoveItMarketing #StateOfSearch
TWO-PHASE
RANKING IN
A SEARCH
NODE
Presented
by
B
Cambazoglu at
European
Summer
School
Information
Retrieval
2017
– (Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,
2011.
Scalability
challenges
in
web
search
engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin
Heidelberg.)
75. @dawnieando from
@MoveItMarketing #StateOfSearch
‘URL
CRUFT’
IS
A
THING
“characters relevant
or
meaningful
only
to
the
people
who
created
the
site,
such
as
implementation
details
of
the
computer
system
which
serves
the
page.
Examples
of
URL
cruft
include filename
extensions such
as .php or .html,
and
internal
organizational
details
such
as /public/or /Users/john/work/draft
s/.[9]”
(Wikipedia
Definition)
76. ALL
THE
RANDOM
URLS
YOU
CREATED
OVER
THE
YEARS
&
SITES
(EVEN
BY
ACCIDENT)
77. @dawnieando from
@MoveItMarketing #StateOfSearch
410 Gone
§ “Some,
we’ll
just
kill
off
with
a
410…”
§ “Then
the
URLs
will
be
gone”
78. @dawnieando from
@MoveItMarketing #StateOfSearch
https://www.youtube.com/watch?v=xp5Nf8ANfOw
THE
DIFFERENCE
BETWEEN
HOW
GOOGLE
TREATS
404
VERSUS
410s
79. @dawnieando from
@MoveItMarketing #StateOfSearch
302
==
Default 301
==
Intentional
404
==
Default 410
==
Intentional
“The
410
response
is
primarily
intended
to
assist
the
task
of
web
maintenance
by
notifying
the
recipient
that
the
resource
is
intentionally
unavailable
and
that
the
server
owners
desire
that
remote
links
to
that
resource
be
removed.”
(RFC
7231)
https://tools.ietf.org/html/rfc7231#section-‐6.5.9
ARE YOU SURE?
MAYBE YES
81. @dawnieando from
@MoveItMarketing #StateOfSearch
DO NOT THINK 410s WON’T BE
RECRAWLED AGAIN
Source:
https://www.docsplace.org/4578/09/410-‐gone-‐stops-‐crawling-‐dead-‐urls/
82. @dawnieando from
@MoveItMarketing #StateOfSearch
“We
knew
there
was
content
there
at
some
point
so
we
just
swing
by
every
now
and
then
to
see
if
anything
came
back”
(John
Mueller,
2016)
In Reality… Gone Is Never Gone
83. @dawnieando from
@MoveItMarketing #StateOfSearch
A URL IS ’NOT’ CONTENT
IT IS A LOCATION WHERE A
RESOURCE LIVES / LIVED
IT MERELY JUST BOILS
DOWN TO A DOC ID MAPPED
TO TERM IDS IN A MATRIX
84. @dawnieando from
@MoveItMarketing #StateOfSearch
ZOMBIES
ARE
NEVER
GONE
NO
URLS
ARE
EVER
GONE
ONLY
THE
RESOURCE
THERE
IS
GONE
https://www.seroundtable.com/google-‐410-‐indexing-‐22584.html
5
YEARS
LATER
85. @dawnieando from
@MoveItMarketing #StateOfSearch
HOW ABOUT 14 YEARS LATER?
https://www.webmasterworld.com/google/4864613.htm
2
HOURS
ALIVE…
14
YEARS
LATER
88. @dawnieando from
@MoveItMarketing #StateOfSearch
“COOL
URIs
DON’T
CHANGE”
Sir
Tim
Berners-‐Lee
(Inventor
of
the
World
Wide
Web)
https://www.w3.org/Provider/Style/URI
Attrubution:
By
Uldis Bojārs (Flickr.)
[CC
BY-‐SA
2.0
(http://creativecommons.org/licenses/by-‐sa/2.0)],
via
Wikimedia
Commons
89. @dawnieando from
@MoveItMarketing #StateOfSearch
YOU END UP WITH A CONGA LINE OF
LEGACY URLS, SUBDOMAINS
& VARIOUS SITE
PROTOCOLS
…In
the
URL
queue
90. @dawnieando from
@MoveItMarketing #StateOfSearch
URL_SEEN TEST
YOU CAN’T JUST KEEP TRYING TO JUMP
THE INDEXING QUEUE EITHER
PUSH
INDEXING PULL INDEXING
E.G.
FETCH
AS
GOOGLEBOT
&
SUBMIT
TO
INDEX,
XML
SITEMAP
SUBMISSIONS
VISITS
BY
NATURAL
CRAWLING
&
DISCOVERY
OF
URLS
/
URL
VISIT
SCHEDULING
/
REVISITS
93. @dawnieando from
@MoveItMarketing #StateOfSearch
TWO STEPS FORWARD & ONE STEP BACK
STRONG
CANONICAL
CONTEXT
URL
YES YES YES
NONO
YES YES YES YES YES
NO NO NO NO
AN
OTHER
OR
MULTIPLE
WEAK
ALTERNATIVES
94. @dawnieando from
@MoveItMarketing #StateOfSearch
PAST DATA ON CHANGE IS A GREAT
PREDICTOR OF FUTURE DATA
PREDICTION
BASED
PRIORITY
SCHEDULING
…
WHEN
THERE
IS
CONSISTENCY
“past
changes
to
a
page
are
a
good
predictor
of
future
changes.
This
result
has
practical
implications
for
incremental
web
crawlers
that
seek
to
maximize
the
freshness
of
a
web
page
collection
or
index.”
(
97. @dawnieando from
@MoveItMarketing #StateOfSearch
‘Transitive’?? - ‘THE WHOLETREE IS ROTTEN’
Transitive
-‐ A
==
B
+
B
==
C
then
A
==
C
For
some
types
of
content
more
than
others
– e.g.
ecommerce/directories
but
not
news
SAMPLING
98. @dawnieando from
@MoveItMarketing #StateOfSearch
CRAWL
SAMPLES
ALSO
HELP
WITH
MODELLING
TO
MAP
DOCS
TO
TOPIC
RELEVANCE
&
RELATEDNESS
102. @dawnieando from
@MoveItMarketing #StateOfSearch
SOME
SOLUTIONS
§ How
can
you
change
the
hints
associated
with
your
site
for
better
rankings
and
SEO?
105. @dawnieando from
@MoveItMarketing #StateOfSearch
THE
BOSTON
MATRIX &
SEO Cash
Cows
(High
converting
queries
&
URLs)
Dogs
Low
return,
low
conversion
queries
/
URLs
Question
Marks
(Jury’s
out)
?
PRIORITIZE
THE
DEBT
MARKET
GROWTH
MARKET
SHARE
Stars
High
potential.
Worth
more
effort
106. @dawnieando from
@MoveItMarketing #StateOfSearch
ADD EVERYTHING TO GSC FROM THE PAST
& PRESENT
THERE
MAY
STILL
BE
UNDETECTED
ACTIVITY
GOING
ON
THERE
107. @dawnieando from
@MoveItMarketing #StateOfSearch
IDENTIFY
PAGES
IN
QUERY
CLUSTERS
(QUERY
CLASSES
&
INTENT
MEETING
SAME
INFORMATION
NEED
CATEGORY)
108. @dawnieando from
@MoveItMarketing #StateOfSearch
REVIEW RELATIVE IMPORTANCE
SIGNALS OF INTERNAL LINKS
ARE
THESE
REALLY
AMONGST
YOUR
MOST
IMPORTANT
URLS?
109. @dawnieando from
@MoveItMarketing #StateOfSearch
BUT…
REVIEW THE
‘RELATEDNESS’
OF
INTERNAL
LINKS
TO PAGES
Domain URL
INTERNALLY
LINKING
PAGES
TO
THE
TARGET
URL
IS
THE
‘RELATEDNESS’
HIGHLY
RELEVANT
TO
ASSIST
WITH
CONTEXTUAL
&
SEMANTIC
SIGNALS?
IS
RELATEDNESS
HIGH?
112. @dawnieando from
@MoveItMarketing #StateOfSearch
SOME QUESTIONS TO ASK
HOW MANY MICRO-SITES HAVE YOU HAD?
HOW MANY SUBDOMAINS?
HOW MANY OTHER DOMAINS?
WHO IS RESPONSIBLE FOR DOMAIN REG
WHO KNOWS WITHIN THE ORGANISATION?
WHO REGISTERED THE DOMAINS?
WHO CAN UPDATE DNS RECORDS?
ARE THESE SITES STILL ON SERVERS?
HAVE ANY OF THESE SITES HAD MANUALACTIONS?
HOW ARE THESE SITES REDIRECTED?
ARE THEY PARKED DOMAINS?
113. @dawnieando from
@MoveItMarketing #StateOfSearch
Source:
https://www.seroundtable.com/poll-‐log-‐files-‐seo-‐24523.html
NEARLY
1/3
OF
SEOs
SAY
THEY
DON’T
NEED
LOG
FILES
114. @dawnieando from
@MoveItMarketing #StateOfSearch
DIAGNOSE: SERVER LOG FILE ANALYSIS
BUT
WATCH
OUT
FOR
OTHER
TOOLS
EMULATING
GOOGLEBOT
AND
FILTER
THEM
OUT
ANALYSE
THE
LOGS
FOR
‘ALL’
YOUR
SITES
AND
‘ALL’
PROTOCOLS
TO
SEE
THE
CRAWL
PATTERNS
EMERGE
NB:
YOU
MAY
BE
LOOKING
AT
URLS
QUEUED
LONG
AGO
120. @dawnieando from
@MoveItMarketing #StateOfSearch
Refactoring
Definition
“Refactoring
…is
a
disciplined
technique
for
restructuring
an
existing
body
of
code,
altering
its
internal
structure
without
changing
its
external
behavior.
Its
heart
is
a
series
of
small
behavior
preserving
transformations.”
https://en.wikipedia.org/wiki/Code_refactoring
121. @dawnieando from
@MoveItMarketing #StateOfSearch
SEO
REFACTORING
HOUSE
KEEPING
WORKING
ON
THE
PAST,
PRESENT
&
FUTURE
SIMULTANEOUSLY
USING
APPROACHES
LIKE
MoSCoW
ONGOING
ITERATIVE
IMPROVEMENTS
ONGOING
ROLLING
AUDITS
PAYING
OFF
DEBT
‘A
BIT
AT
A
TIME’
MARGINAL
GAINS
123. @dawnieando from
@MoveItMarketing #StateOfSearch
HAVE YOUR SAY IN CRAWLING ‘RULES’
Help
Google
Build
‘Crawling
Rules’
for
your
site
rather
than
wasting
time
on
‘sampling’
and
giving
a
bad
impression
GIVE
HELP
AND
GUIDANCE
WITH
THE
CRAWL
RULE
AND
HINT
BUILDING
124. @dawnieando from
@MoveItMarketing #StateOfSearch
Help
Google
Build
‘Crawling
Rules’
for
your
site
rather
than
wasting
time
on
‘sampling’
and
giving
a
bad
impression
BE
VERY
CAREFUL
125. @dawnieando from
@MoveItMarketing #StateOfSearch
REVISIT ALLPAST .HTACCESS FILES
Can
you
rewrite
the
rules
to
be
more
efficient
with
regex
or
cut
out
some
old
rules
still
firing
unnecessarily?
(CREATE
SHORTCUTS)
REMEMBER
.HTACCESS
RULES
RUN
IN
ORDER
OF
THEIR
APPEARANCE
IN
THE
FILE.
CAN
YOU
USE
WILDCARDS
TO
OPTIMIZE
OR
SKIP
STEPS?
.HTACCESS
SITE
1
.HTACCESS
SITE
2
.HTACCESS
SITE
3
126. @dawnieando from
@MoveItMarketing #StateOfSearch
Learn By Heart Regular
Expressions & How URLs
with Multiple Parameters Are
Handled
The
most
restrictive
parameter
blocked
overrules
lesser
restrictions
131. @dawnieando from
@MoveItMarketing #StateOfSearch
REVIEW & UNDERSTAND - THE
CANONICAL LINK RELATION
§ 30X
redirects
§ Canonical
tag
§ Href lang
§ HTTPS
protocol
§ Global
canonicalization
rules
§ URL
normalization
In
’ALL’
its
forms
RFC6596
140. @dawnieando from
@MoveItMarketing #StateOfSearch
SOLUTION: Wiki Page
Redirects on Topics
https://dbpedia.org/sparql
Wikipedia
Redirects
thesaurus.com
OR
A
GOOD
OLD
FASHIONED
THESAURUS
142. @dawnieando from
@MoveItMarketing #StateOfSearch
USE
‘STRONGLY
CONNECTED
COMPONENTS’
(TOPICAL
HUBS
FOR
FOCUSED
CRAWLING)
TO
REAFFIRM
THE
SEMANTIC
STRENGTH
YOU
ONCE
HAD
143. @dawnieando from
@MoveItMarketing #StateOfSearch
BUILD
RICH
CONTENT
HUBS
FOR
PRIMARY
TARGET
TOPICS
Broder,
A.,
Kumar,
R.,
Maghoul,
F.,
Raghavan,
P.,
Rajagopalan,
S.,
Stata,
R.,
Tomkins,
A.
and
Wiener,
J.,
2000.
Graph
structure
in
the
web. Computer
networks, 33(1),
pp.309-‐320.
STRONGLY
CONNECTED
HUB
144. @dawnieando from
@MoveItMarketing #StateOfSearch
BUILD WELL CATEGORIZED AND
CONCEPTUALLY STRUCTURED
SITEMAPS
https://www.slideshare.net/p
atrickstox/nlp-‐sitemap-‐smx-‐
2016-‐patrick-‐stox-‐latest-‐in-‐
advanced-‐technical-‐seo
145. @dawnieando from
@MoveItMarketing #StateOfSearch
XML Sitemaps Are Your Friend… (Strong
Foundations)
They
help
to
pass
‘importance’
signals
to
URLs
But…
never
leave
them
to
just
autogenerate
without
periodically
checking
‘The
foundations’
underneath
a
site
147. @dawnieando from
@MoveItMarketing #StateOfSearch
EXTERNALLY HOSTED XML SITEMAPS
• Take
back
control
• Jump
the
dev
queue
• Allows
for
custom
configuration
of
optimal
canonical
click
paths
• Allows
for
consistent
signals
of
importance
to
included
URLs
• Forget
about
setting
priority
• Forget
about
last
modified
• Even
a
simple
list
of
URLs
FTW
will
do
• Keep
them
organised for
granular
analysis
of
problem
site
sections
150. @dawnieando from
@MoveItMarketing #StateOfSearch
BUT…
REVIEW THE
‘RELATEDNESS’
OF
INTERNAL
LINKS
TO PAGES
Domain URL
INTERNALLY
LINKING
PAGES
TO
THE
TARGET
URL
IS
THE
‘RELATEDNESS’
HIGHLY
RELEVANT
TO
ASSIST
WITH
CONTEXTUAL
&
SEMANTIC
SIGNALS?
IS
RELATEDNESS
HIGH?
154. @dawnieando from
@MoveItMarketing #StateOfSearch
“It’s
simple
really,
the
businesses
seeing
growth
in
natural
search
are
those
implementing
technical
changes
successfully,
the
most
common
cause
of
decline
is
either
ignoring
technical
or
getting
it
wrong.”
Tim
Grice,
Branded3,
2017
https://www.branded3.com/blog/link-‐spam-‐migration-‐disasters-‐penguin-‐organic-‐growth-‐2017/
157. @dawnieando from
@MoveItMarketing #StateOfSearch
BUT WHEN DATA IS INCONSISTENT
FUZZY LOGIC MAY FAIL
‘DEGREES
OF
TRUTH’
MORE
BLURRED
/
VAGUE
158. 1st
level
relatedness
A
measure
of
words
that
directly
occur
together
in
a
text
or
‘corpus’
(collection
of
documents
together)
’TWO
WORDS
WHICH
TEND
TO
CO-‐OCCUR
MUST
BE
RELATED’
CO-‐OCCURRENCE
VECTORS
EXAMPLES:
car/automobile,
coast/shore,
furnace/stove
(Miller
&
Charles,
1991)
159. 2nd
level
relatedness
Share
common
words
they
co-‐occur
with
aside
from
directly
co-‐occurring
together
(both
appear
in
same
types
of
text
as
each
other
===
related
EXAMPLE:
FURNACE
&
OVEN
BOTH
SHARE
HEAT,
MOTOR
&
ROAD,
CAR
&
AUTOMOBILE
BOTH
SHARE
PASSENGERS
CO-‐OCCURRENCE
VECTORS
160. @dawnieando from
@MoveItMarketing #StateOfSearch
MORE Solutions
• Do
a
bit
of
‘up
front’
thinking
(AVOID
TECHNICAL
DEBT
IN
FIRST
PLACE)
• Measure
SEO
technical
debt
• Refactor
SEO
technical
debt
away
• Reducing
SEO
technical
debt
should
be
inbuilt
• Accept
some
SEO
(least
impactful)
technical
debt
is
necessary
for
agility
• ’Chip
away’
at
SEO
technical
debt
161. @dawnieando from
@MoveItMarketing #StateOfSearch
’Fuzzy’ URL Targets with Each Site Generation
EVERYTHING
GETS
A
BIT
BLURRED
‘Which
is
the
target
URL
again?
162. @dawnieando from
@MoveItMarketing #StateOfSearch
”The
URL
page
importance
score
can
be
retrieved
from
the
…
URL
history
log …or
it
can
be
obtained
by
obtaining
the
historical
page
importance
score
for
the
URL
for
a
predefined
number
of
prior
crawls
and
then
performing
a
predefined
filtering
function
on
those
values
to
obtain
the
URL
page
importance
score.”
Scheduler
for
Search
Engine
Crawler
https://www.google.com/patents/US8042112
DOC
ID CRAWL
1
IMPORTANCE
RECORD
CRAWL
2
IMPORTANCE
RECORD
CRAWL 3
IMPORTANCE
RECORD
CRAWL
4
IMPORTANCE
RECORD
CRAWL
5
IMPORTANCE
RECORD
CRAWL
6
IMPORTANCE
RECORD
DOC
ID
1 1 0.8 0.6 0.4 0.2 0
DOC
ID
2 0 0.2 0.4 0.6 0.8 1
163. @dawnieando from
@MoveItMarketing #StateOfSearch
Example
MoSCoW Prioritisation
MUST
HAVE SHOULD
HAVE COULD
HAVE WON’T
HAVE
THIS
TIME
Remove
infinite
loops Identify ideal
click
paths
Canonicalize to
superset e.g.
Page
title
rewriting
Redirect
true
dupes Check
server
log files Upcycle
conflict content e.g.
All
meta-‐description
Review
parameter
handling
Analyse queries
on
near-‐dupes
Strengthen
categories &
subcategories
(relevance)
Review
‘added-‐value’
difference
in
’similars’
Add
seasonal
&
TIME
IS
OF
THE
ESSENCE
content
pieces
(topical /
evergreen)
Review internal
link
popularity
of
important
pages
Build
topic
hub
static pages
(Strongly
connected
component)
Add
‘flow’
content
to
amplify
via
social
Check soft
404s Review
crawling
on
near dupes
&
similars
Sectional content
audit Add site
section
properties
in
GSC
Check
server
errors Review queries
on
similars
Add categorized
XML
sitemaps
Add
content
from sub
to
superset
&
canonicalize
PAST
PRESENT
FUTURE
164. @dawnieando from
@MoveItMarketing #StateOfSearch
URL
NORMALIZATION
Can be
problematic
and ‘crufty’
too
https://en.wikipedia.org/wiki/URL_normalization
165. @dawnieando from
@MoveItMarketing #StateOfSearch
SOLUTION – Think Carefully About
Creating New Dynamic Parameters
QUEUEING…
AGAIN
Waiting
for
good
URLs
to
be
visited…
AGAIN
169. @dawnieando from
@MoveItMarketing #StateOfSearch
TERM-FREQUENCY INVERSE
DOCUMENT FREQUENCY
Architectural,
URL,
software
&
content
cruft
can
also
skew
term-‐
frequency
inverse
document
frequency
AND
THE
QUERY
CLUSTERS
DOCUMENTS
BELONG
TO
170. @dawnieando from
@MoveItMarketing #StateOfSearch
YOU INHERITED SEO TECHNICAL DEBT
• Previous
content
/
link
manual
actions
• Previous
algorithmic
suppressions
• Past
infinite
loops
• “We’ll
SEO
it
after
launch”
• “SEO
is
dead…
so
we
won’t
optimise”
• Dodgy
URL
parameters
• SEO
is
a
‘one
time
audit’
• Misconfigured
URL
parameters
• Old
URL
crawling
‘rules
/
hints’
171. @dawnieando from
@MoveItMarketing #StateOfSearch
CRAWLING PATTERNS ARE
DEVELOPED FOR EFFICIENCY
- CRAWLERS TAKES ‘HINTS’AND ‘HINT
RANGES’ (rules / patterns)
Help
Google
Build
‘Crawling
Rules’
for
your
site
rather
than
wasting
time
on
‘sampling’
and
giving
a
bad
impression
GIVE
HELP
AND
GUIDANCE
WITH
THE
CRAWL
RULE
AND
HINT
BUILDING
175. @dawnieando from
@MoveItMarketing #StateOfSearch
History Log Records Include:
• URL
fingerprint
• Timestamp
(last
crawl
or
download
attempt)
• Crawl
status
(success
or
error)
(Response
code)
• Content
checksum
(binary
code)
• Source
ID
(accessed
from
cache
or
downloaded)
• Segment
identifier
(Crawl
segment
assigned
to??)
• Page
importance
(a
measure
of
importance
assigned
to
the
URL)
182. @dawnieando from
@MoveItMarketing #StateOfSearch
THINK CAREFULLY ABOUT URL CREATION
Not
EVERYTHING
is
worthy
of
its
own
URL
VARIANTS
STEMMINGS
PLURALS
RANDOM
TAGS
LONG,
LONG,
LONG
TAIL
PARAMETERS
187. @dawnieando from
@MoveItMarketing #StateOfSearch
MIXED
CONTENT &
MULTIPLE SITE
VERSIONS
http://www.itv.com/news/
BOTH
HTTP
&
HTTPS
FIGHTING
EACH
OTHER
189. @dawnieando from
@MoveItMarketing #StateOfSearch
410’s
DO
USE
CRAWL
BUDGET
(MAYBE
NOT
TOO
MUCH
ON
REVISITS,
BUT
THESE
THINGS
ADD
UP).
THEY
ALSO
STILL
NEED
TO
BE
DISCOVERED
WHICH
USES
BUDGET
https://twitter.com/dawnieando/status/906465965029969920
190. @dawnieando from
@MoveItMarketing #StateOfSearch
GENERATIONAL
CRUFT
CAN
SNOWBALL
• Past
infinite
loops
• Dodgy
URL
parameters
• Misconfigured
URL
parameters
• Old
URL
crawling
‘rules
/
hints’
• Old
‘importance
/
quality’
scores
• Filtered
dupes
&
near-‐dupes
• Mixed
messaging
canonicals
• 410s
still
being
revisited
• Internal
links
to
old
sites
/
protocols
192. @dawnieando from
@MoveItMarketing #StateOfSearch
The Generational ’Snail Trail’
• Old
XML
sitemaps
• Redirects
drop
away
on
old
site
.htaccess
• DNS
issues
• People
link
to
old
site
but
wrong
protocol
• Old
sites
no
longer
verified
in
GSC
• Not
all
protocols
redirecting
Leaving
it’s
slithery
footprint
193. @dawnieando from
@MoveItMarketing #StateOfSearch
History Log Records Include:
• URL
fingerprint
• Timestamp
(last
crawl
or
download
attempt)
• Crawl
status
(success
or
error)
(Response
code)
• Content
checksum
(binary
code)
• Source
ID
(accessed
from
cache
or
downloaded)
• Segment
identifier
(Crawl
segment
assigned
to??)
• Page
importance
(a
measure
of
importance
assigned
to
the
URL)
May
be
calculated
by
identifying
historical
importance
scores
based
on
past
X
number
of
crawls
194. @dawnieando from
@MoveItMarketing #StateOfSearch
EVERY
SINGLE
TIME
YOU
MIGRATE,
CHANGE
DESIGN,
REDIRECT,
REINVENT
A
SITE
/
URL
A
CLEAN
START
REDIRECTIONS
ANOTHER
STRUCTURE
FIRST
SITE
STRUCTURE
NEW
CRAWLING
‘RULES’
BUILT
CRAWLING
‘RULES’
BUILT
EVERYTHING
IS
‘200
OK’
MORE
URLs
MIXED
RESPONSE
CODES
REDIRECTIONS
‘FUZZINESS’
IS
EMERGING
NEW
CRAWLING
‘RULES’
BUILT
MORE
URLs
REDIRECT
CHAINS
&
MIXED
RESPONSE
CODES
NEW
SEO’s
DON’T
KNOW
THE
‘HISTORY’
TARGET
URLs
NOW
‘VERY
FUZZY’
195. @dawnieando from
@MoveItMarketing #StateOfSearch
SOLUTION: Wiki
Page Redirects on
Topics
https://dbpedia.org/sparql
Wikipedia
Redirects
196. @dawnieando from
@MoveItMarketing #StateOfSearch
Time Seems To Fly… The Older You Get
Your
new
site
URL
is
just
one
of
very
many
historical
URLs
on
your
IP
to
be
visited
periodically
A
tiny
fish
in
a
very
big
URL
pond
queue
198. @dawnieando from
@MoveItMarketing #StateOfSearch
A New Beginning
§ “A
new
website
will
solve
ALL
our
problems”
“Let’s
start
again”
“We’ll
just
migrate…
and
redirect
everything”
199. @dawnieando from
@MoveItMarketing #StateOfSearch
A
LONG,
LONG
TIME
AGO
• You
need
to
go
right
back
to
the
beginning
• What
domains
did
the
organisation EVER
register?
• Where
do
they
redirect
to?
• Is
it
via
301,
302
or
are
they
merely
parked
domains?
• Who
would
know?
Who
is
responsible?
• Verify
them
all
in
Google
Search
Console
• Some
of
these
may
EVEN
HAVE
PENALTIES
HISTORICALLY
• If
there
are
links
to
any
there
is
likely
still
crawling
activity
there
• Analyse logs
across
multiple
subdomains
&
protocols
200. @dawnieando from
@MoveItMarketing #StateOfSearch
SOME TYPES OF URL CRUFT
• INCORRECTLY
APPLIED
CANONICAL
TAGS
• CONFLICTING
HREF
LANG
&
CANONICAL
TAGS
• MIXED
CONTENT
• URL
SHORTENERS
• SESSION
IDS
• UTM
TAGGING
• OLD
AJAX
FRAGMENTS
• PARAMETERS
FROM
MULTI
FACET
DROP
DOWN
CHOICES
• .html,
.php,
.index.html,
.aspx
• LEGACY
URL
REWRITING
&
PARAMETERS
IN
.HTACCESS
FILES
• LEGACY
FOLDERS
WHICH
CONTRIBUTE
NO
MEANING
TO
SITE
ONTOLOGY
UNCRUFTY
www.myeasyurlwillmakeyouw
onder.com/resume
CRUFTY
www.myeasyurlwillmakeyouw
onder.com/resume.html
CRUFTY
http://nymag.com/scienceofus/2015/07/how-‐
to-‐recover-‐from-‐an-‐all-‐
nighter.html?om_rid=AAENcg&om_mid=_BTtF
a0B869PyJp&utm_content=buffer8fdd1&utm_
medium=social&utm_source=twitter.com&ut
m_campaign=buffer
202. @dawnieando from
@MoveItMarketing #StateOfSearch
IT’S
VERY
IMPORTANT…
YOU
STAY
OUT
OF
SERVER
ERROR
STATUS
500
‘Try
again’
intervals
likely
extended
between
each
failed
connection
attempt
204. @dawnieando from
@MoveItMarketing #StateOfSearch
LEGACY ISSUES VIA CANONICALS OR
REDIRECTION (COMMON MISTAKES)
• PAGE
CANONICALIZED
TO
IS
NOT
A
SUPERSET
OR
DUPLICATIVE
(IT
IS
NOT
RELEVANT
ENOUGH)
• 301s
TO
IRRELEVANT
PAGES
BECOME
SOFT
404
• FOLDING
UP
PRODUCT
PAGES
TO
CATEGORES
(PEOPLE
WERE
LOOKING
FOR
A
SPECIFIC
PRODUCT)
• CANONICALIZATION
TO
PAGES
WHEN
IN
THE
FUTURE
301
REDIRECT
TO
ANOTHER
URL
THEREFORE
NEGATING
THE
PAGES
CANONICALIZING
TO
THEM
• CONFLICTS
BETWEEN
HREF
LANG
AND
CANONICALIZATION
205. @dawnieando from
@MoveItMarketing #StateOfSearch
SOLUTION: Increase ‘Importance’ quickly of
target URLs
• Internal
link
optimization
• Canonicalise to
(if
relevant)
• Strengthen
up
importance
signals
• Inclusion
in
front
facing
HTML
and
XML
sitemaps
• Improve
the
content
&
keep
it
updated
• 301
redirect
to
(if
relevant
redundant
content)
• Topical
hubs
and
strong
information
views
to
navigate
users
&
add
relevance
206. @dawnieando from
@MoveItMarketing #StateOfSearch
SOLUTION: Reduce ‘Importance’ quickly of old
URLs
• Internal
link
UNOPTIMIZATION
• 410
• Dig
out
URLs
with
links
to
them
• Orphan
URLs
• Canonicals
to
HTTPs
• EXCLUSION
from
XML
sitemaps
(even
old
ones
on
the
server)
• Archiving
of
content
207. @dawnieando from
@MoveItMarketing #StateOfSearch
404
NOT
FOUND
&
410
GONE
§ “Of
course,
we
won’t
redirect
everything…”
§ “Not
everything
will
be
worth
redirecting”
208. @dawnieando from
@MoveItMarketing #StateOfSearch
“Usually
seeing
it
(410)
1-‐2
times
is
enough
for
us
to
drop
those
URLs
from
the
index”
John
M
on
Google+
(https://plus.google.com/u/0/+JohnMueller/posts/NEsqE7Sr4Z4)
209. @dawnieando from
@MoveItMarketing #StateOfSearch
410 Likely Get Deindexed Quicker
https://plus.google.com/+JohnMueller/
posts/NEsqE7Sr4Z4
210. @dawnieando from
@MoveItMarketing #StateOfSearch
“404
vs
410
doesn't
affect
the
recrawl
rate:
we'll
still
occasionally
check
to
see
if
these
pages
are
still
gone,
espectially when
we
spot
a
new
link
to
them”
John
Mueller,
Google+
2015
https://plus.google.com/u/0/+JohnMu
eller/posts/NEsqE7Sr4Z4
410 – DOES THAT PAGE NEED TO BE
REINDEXED?
211. @dawnieando from
@MoveItMarketing #StateOfSearch
The URL Generational ’Snail Trail’
• Old
XML
sitemaps
• Badly
coded
subcategory
&
attribute
parameters
• Redirects
drop
away
on
old
site
.htaccess
• Canonicalizing and
then
later
‘301ing’
‘context’
URL
(invalid
canonical)
• DNS
issues
• People
link
to
old
site
but
wrong
protocol
• Old
sites
not
verified
in
GSC
• Not
all
protocols
redirecting
• Relative
Wordpress URLs
appending
/wwws on
current
viewed
pages
• JS
fired
URLs
on
Language
drop
down
Internationalization
crawled
• Legacy
Ajax
issues
with
parts
of
page
content
pulled
• Canonical
URLs
NOT
a
superset
or
duplicate
of
canonicals
pointing
at
them
Leaving
it’s
slithery
footprint
212. @dawnieando from
@MoveItMarketing #StateOfSearch
INSTEAD
OF
REMOVE…
CONSIDER…
DISTRACT
&
ITERATIVELY
IMPROVE
STRATEGIC
USE
OF
INTERNAL
LINK
POPULARITY
REDUCE
IMPORTANCE
SIGNALS
TO
DIFFERENT
PAGES
INCLUDE
IMPORTANT
PAGES
IN
XML
SITEMAPS
EXCLUDE
LOW
IMPORTANCE
PAGES
IN
XML
SITEMAPS
INCLUDE
IMPORTANT
PAGES
IN
HTML
SITEMAPS
213. @dawnieando from
@MoveItMarketing #StateOfSearch
“404
vs
410
doesn't
affect
the
recrawl
rate:
we'll
still
occasionally
check
to
see
if
these
pages
are
still
gone,
especially
when
we
spot
a
new
link
to
them”
John
Mueller,
Google+
2015
https://plus.google.com/u/0/+JohnMu
eller/posts/NEsqE7Sr4Z4
ESPECIALLY IF
THERE ARE
LINKS TO IT
214. 2nd
level
relatedness
Share
common
words
they
co-‐occur
with
aside
from
directly
co-‐occurring
together
(both
appear
in
same
types
of
text
as
each
other
===
related
EXAMPLE:
FURNACE
&
OVEN
BOTH
SHARE
HEAT,
MOTOR
&
ROAD,
CAR
&
AUTOMOBILE
BOTH
SHARE
PASSENGERS
CO-‐OCCURRENCE
VECTORS
215. @dawnieando from
@MoveItMarketing #StateOfSearch
Aged ‘Patchwork Quilt’ Sites
A
LITTLE
BIT
OF
THIS
CMS
AND
A
LITTLE
BIT
OF
THAT
CMS
MANY
HISTORICAL
PARAMETERS
CREATED
&
CRAWLING
SAMPLE
PATTERNS
216. @dawnieando from
@MoveItMarketing #StateOfSearch
LACK
OF
PROCESS
OR
UNDERSTANDING
• Lack
of
process
or
understanding
• No
or
poor
documentation
to
work
to
• Insufficient
testing
facilities
&
staging
/
optimizing
environments
• Lack
of
collaboration
between
depts
• Parallel
development
&
version
control
issues
(too
much
happening)
• Small
improvements
left
till
last
• Business
pressures
/
business
case
demands
• Insufficient
‘up
front’
definition
(scope
creep)
LOTS
OF
REASONS
FOR
TECHNICAL
DEBT
218. TECHNICAL
DEBT
IS
NOT
ALWAYS
ABOUT
BAD
CODE
IT
OFTEN
COMES
AS
A
RESULT
OF
MINIMUM
VIABLE
PRODUCT
219. @dawnieando from
@MoveItMarketing #StateOfSearch
THESE
THINGS
ADD
UP
THEY
ALSO
STILL
NEED
TO
BE
DISCOVERED
WHICH
REQUIRES
INITIAL
CRAWLING
https://twitter.com/dawnieando/status/906465965029969920
222. @dawnieando from
@MoveItMarketing #StateOfSearch
Sources & References
Bar-‐Yossef,
Z.,
Keidar,
I.
and
Schonfeld,
U.,
2009.
Do
not
crawl
in
the
dust:
different
urls with
similar
text. ACM
Transactions
on
the
Web
(TWEB), 3(1),
p.3
Broder,
A.Z.,
Najork,
M.
and
Wiener,
J.L.,
2003,
May.
Efficient
URL
caching
for
world
wide
web
crawling.
In Proceedings
of
the
12th
international
conference
on
World
Wide
Web (pp.
679-‐689).
ACM
Broder,
A.,
Kumar,
R.,
Maghoul,
F.,
Raghavan,
P.,
Rajagopalan,
S.,
Stata,
R.,
Tomkins,
A.
and
Wiener,
J.,
2000.
Graph
structure
in
the
web. Computer
networks, 33(1),
pp.309-‐320.
Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,
2011.
Scalability
challenges
in
web
search
engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin
Heidelberg.
Cho,
J.,
Garcia-‐Molina,
H.
and
Page,
L.,
1998.
Efficient
crawling
through
URL
ordering. Computer
Networks
and
ISDN
Systems, 30(1),
pp.161-‐172
Fetterly,
D.,
Manasse,
M.,
Najork,
M.
and
Wiener,
J.,
2003,
May.
A
large-‐scale
study
of
the
evolution
of
web
pages.
In Proceedings
of
the
12th
international
223. @dawnieando from
@MoveItMarketing #StateOfSearch
Sources & References
Grice,
T,
2017. Link
spam,
migration
disasters
and
Penguin
is
nowhere
to
be
seen
-‐
Organic
growth
in
2017 [ONLINE]
Available
at: https://www.branded3.com/blog/link-‐
spam-‐migration-‐disasters-‐penguin-‐organic-‐growth-‐2017/.
[Accessed
08
October
2017].
Olston,
C.
and
Najork,
M.,
2010.
Web
crawling. Foundations
and
Trends®
in
Information
Retrieval, 4(3),
pp.175-‐246.
Pandey,
S.
and
Olston,
C.,
2008,
February.
Crawl
ordering
by
search
impact.
In Proceedings
of
the
2008
International
Conference
on
Web
Search
and
Data
Mining (pp.
3-‐14).
ACM.
Olston,
C.
and
Pandey,
S.,
2008,
April.
Recrawl scheduling
based
on
information
longevity.
In Proceedings
of
the
17th
international
conference
on
World
Wide
Web (pp.
437-‐446).
ACM
Pandey,
S.
and
Olston,
C.,
2005,
May.
User-‐centric
web
crawling.
In Proceedings
of
the
14th
international
conference
on
World
Wide
Web (pp.
401-‐411).
ACM.
Pandey,
S.
and
Olston,
C.,
2008,
February.
Crawl
ordering
by
search
impact.
In Proceedings
of
the
2008
International
Conference
on
Web
Search
and
Data
Mining (pp.
3-‐14).
ACM
224. @dawnieando from
@MoveItMarketing #StateOfSearch
Sources & References
martinfowler.com.
2009. TechnicalDebtQuadrant.
[ONLINE]
Available
at: https://martinfowler.com/bliki/TechnicalDebtQuadrant.html.
[Accessed
03
October
2017].
https://martinfowler.com/bliki/TechnicalDebtQuadrant.html
Malte Ubi on
Twitter
-‐ https://twitter.com/cramforce/status/897502737268592640
Is
Technology
Debt
Bankrupting
Your
Competitiveness
– Accenture
2017
-‐
https://www.accenture.com/t20170504T221347__w__/ie-‐en/_acnmedia/PDF-‐43/Accenture-‐
Strategy-‐Technology-‐Debt-‐PoV.pdf
Project
Management
Certification.
2015. Project
Failure
-‐ Why
Projects
Fail
So
Often.
[ONLINE]
Available
at: http://4pm.com/2015/09/27/project-‐failure/.
[Accessed
30
September
2017].
https://patentimages.storage.googleapis.com/US8042112B1/US08042112-‐20111018-‐D00000.png
Randall,
K.H.,
Google
Inc.,
2010. Scheduler
for
search
engine
crawler.
U.S.
Patent
7,725,452.
https://patentimages.storage.googleapis.com/US8042112B1/US08042112-‐20111018-‐D00000.png
Randall,
K.H.,
Google
Inc.,
2010. Scheduler
for
search
engine
crawler.
U.S.
Patent
7,725,452.
225. @dawnieando from
@MoveItMarketing #StateOfSearch
Sources & References
The
Drum.
2017. On
trend?
The
Wow
Company
reports
on
what
the
average
UK
agency
looks
like
|
The
Drum.
[ONLINE]
Available
at: http://www.thedrum.com/opinion/2017/04/12/trend-‐the-‐wow-‐company-‐reports-‐
what-‐the-‐average-‐uk-‐agency-‐looks.
[Accessed
28
September
2017].