Cruft busting technical debt code smell and refactoring for seo - state of search

@dawnieando from
@MoveItMarketing #StateOfSearch
REDUCING
THE

BURDEN
OF

TECHNICAL
DEBT

IN
SEO

@dawnieando from
The Great 302s Pass PageRank Debate

@dawnieando from
MULTIPLE
GENERATIONS
OF
A

WEBSITE

@dawnieando from
JUST
WHAT
IS

GENERATIONAL
CRUFT?

@dawnieando from
GENERATIONAL
CRUFT
MAKES
CRAWLING,

INDEXING,
QUERY
CLUSTERING
&

SEMANTIC
UNDERSTANDING
MORE

COMPLEX

@dawnieando from
TECHNICAL
DEBT

@dawnieando from
TECHNICAL DEBT

@dawnieando from
RECKLESS
DEBT PRUDENT
DEBT
DELIBERATE
INADVERTENT
DEBT
MARTIN
FOWLER
TECHNICAL
DEBT
QUADRANT
“We
must
launch
now
and

deal
with
consequences”
“Now
we
know
how
we

should
have
done
it”
“We
don’t
have
time
for

design”
“What’s
layering?”
Credit:
Martin
Fowler
Technical
Debt
Quadrant

@dawnieando from
RECKLESS
DEBT PRUDENT
DEBT
DELIBERATE
DEBT
INADVERTENT
DEBT
SEO
TECHNICAL
DEBT
QUADRANT
“SEO
is
dead
/
doesn’t

matter”
“What’s
a
URL
parameter?

“What’s
a
canonical?
“

What’s

internationalization?”
S
E
O
T
E
C
H
N
I
C
A
L
D
E
B
T
“Now
we
know
how
we

should
have
done
it”
(Further
learnings
were

discovered
as
knowledge

grew)
“We
must
launch
now
and
deal

with
SEO
issues
after”
”We’ll
SEO
‘it’
later”
PRUDENT
DEBT

Project Success ===
Produce planned
deliverables, within
budget, on time
(including approved
changes)
Source:
http://4pm.com/2015/09/27/project-‐failure/

70%
Of All
Projects Fail
Source:
http://4pm.com/2015/09/27/project-‐failure/

@dawnieando from
PAST

LEGACY PRESENT

APPLICATION
FUTURE

PLANS

@dawnieando from
SEOs
WAIT
A
LONG
TIME
FOR
DEV

CHANGES
https://moz.com/blog/how-‐long-‐are-‐seos-‐waiting-‐for-‐their-‐most-‐important-‐changes
Over
40%
wait

12
months+
to

get
their
most

crucial
SEO

changes

implemented

@dawnieando from
MAJOR
CAUSES
OF
THIS
https://moz.com/blog/how-‐long-‐are-‐seos-‐waiting-‐for-‐their-‐most-‐important-‐changes
“Legacy
technology

or
outdated

processes

hampering

progress”
“The
change
they

want
is
“not

possible”
with

current
platform

(37%)”
Source:
(Will
Critchlow
Distilled
Research,

on
Moz Blog,
May
2016)

@dawnieando from
Credit:
Dilbert
by
Scott
Adams
Blame it on the marketers

We’ll

SEO
‘it’

later

FRAGILE

…
NOT

AGILE

@dawnieando from
THE
ACCUMULATION
OF
TECHNICAL
DEBT

@dawnieando from
FIGHTING
A
LOSING
BATTLE
UNPAID

SEO
TECHNICAL

DEBT
===
SEO

BANKRUPTCY

@dawnieando from
A Clean Slate
LET’S START WITH
A
CLEAN
SLATE

@dawnieando from
Websites are not disposable
BUT…

@dawnieando from
SEARCH
ENGINES
NEVER
FORGETS
Search
engines

have
a
long

memory
and
a
lot

of
storage

@dawnieando from
A NEW URL HAS NO
BUT YOUR OLD ONES HAVE LOTS

@dawnieando from
Web Crawler System History Logs

GOOGLE
WON’T

WAIT

(LONG)

FOR
YOU

TO
DEAL

WITH
YOUR

TECHNICAL
DEBT

@dawnieando from
THE CHALLENGE IS
NOT IN INDEXING…
BUT IN KEEPING
EVERYTHING
INDEXED UP TO DATE

@dawnieando from
“If
“change”
means
“any
change”,
then
about
40%
of
all
web
pages
change
weekly

[12].
Even
if
we
consider
only
pages
that
change
by
a
third
or
more,
about
7%
of
all

web
pages
change
weekly
[17].”
(Broder,
A.Z.,
Najork,
M.
and
Wiener,
J.L.,
2003)
EVEN
AS
FAR
BACK
IN
2003
40% of ALL web pages
changed weekly
___________________
7%
of
web
pages
changed
a
1/3
of
their

page
content
or
more
weekly

@dawnieando from
HOW
MUCH
BIGGER
&
DYNAMIC
IS
THE
WEB

NOW
IN
2017?
http://www.internetlivestats.com/total-‐number-‐of-‐websites/

@dawnieando from
INCREMENTAL CRAWLING NEVER ENDS
“Crawling
method

based
on
crawl

frequency
based
on

URL
historical

change
&

importance

rate”
Crawling
Which
Never
Ends
Ongoing

@dawnieando from
CRAWLING
FRONTIER
Shestakov,
D.,
2013,

July.
Current

challenges
in
web

crawling.

In International

Conference
on
Web

Engineering (pp.
518-‐
521).
Springer,
Berlin,

Heidelberg.

@dawnieando from
The Crawling ‘Frontier’ (THE URL QUEUE)
‘TO
BE
EXPLORED’
(OR
REVISTED)

@dawnieando from
URLs Are Prioritized By Importance &
Take Their Place in The Frontier Queue
(New & Revisit)

@dawnieando from
DATA FROM
HISTORY LOGS
CONTRIBUTE
TO WHEN TO
REVISIT URIs
ON THE WEB

@dawnieando from
PAST DATA IS A GREAT PREDICTOR
OF FUTURE DATA
PREDICTION
BASED

PRIORITY

SCHEDULING
…
WHEN

THERE
IS

CONSISTENCY

@dawnieando from
‘Sampling’ in Crawling for Efficiency
‘SMALL
TEST
VISITS
TO
A
SITE
TO

UNDERSTAND
WHETHER
IT
IS
WORTH

CRAWLING
&
UNDERSTAND

URL

PATTERNS
&
RESOURCES
THERE’

CRAWLING
‘HINTS’
&
‘HINT

RANGES’

@dawnieando from
DUSTBUSTER & DUST CRAWLING RULES
DO
NOT

CRAWL
IN

THE
DUST
BUILDS

‘HINTS’
ON

WHAT
NOT

TO
CRAWL
EVERY
SITE
WILL

HAVE
ITS
OWN

CRAWLING

RULES

@dawnieando from
Popular CMS ’Rule Patterns’ (URL Parameters)
ALL
WILL
HAVE
COMMON

CANONICALIZATION
PATTERNS
WHICH

CAN
BE
LEARNED

@dawnieando from
Every Version of Your Past Ecommerce Sites
“Exponentially

multiplicative

URLs”
Had
potential
to
spew…
at
some
point…
DIFFERENT
PARAMETERS
&
URL

PATTERNS
WHICH
ARE
LEARNED
BY

CRAWLERS…
AND
REMEMBERED…

FOREVER

@dawnieando from
SEVERAL
TYPES
OF

CRUFT
MAY

CONTRIBUTE

@dawnieando from
SOFTWARE

ROT
CODE
SMELL
SOFTWARE
CRUFT

@dawnieando from
SPAGHETTI
CODE

@dawnieando from
DEPRECATION

@dawnieando from
The

hottest
job
on

the
block

at
one
point
Once
described
by

W3C
Schools
as

‘The
Developers

Dream’
LEGACY
CODE
BASES
&
DEPRECATED
VERSIONS

How
Did
That

Work
Out
For
Your

SEO?

@dawnieando from
WHAT ABOUT ALL THAT CSS & JS
YOU COLLECTED?

@dawnieando from
PEOPLE APPEND (ADD TO FILES) -
SOMETIMES IT’S FEAR OF DEPENDENCIES

@dawnieando from
GUTENBERG
SOURCE:
https://speckyboy.com/meet-‐greg-‐schoppe-‐developer-‐gutenberg/
“WordPress
Core
is
a
minefield
of
design

decisions
that
were
made
for
what

WordPress
was
at
the
time,
and
didn’t
age

well”
(Greg
Schoppe,
2017)

@dawnieando from
https://managewp.com/statistics-‐about-‐wordpress-‐usage
Wordpress now

powers
26%
of

the
web
HUGE

EXAMPLE
OF

GENERATIONAL

SOFTWARE

CRUFT

@dawnieando from
CONTENT
CRUFT

@dawnieando from
WELL…
WE
DID
MAKE
QUITE
A
BIT
OF

CONTENT
http://www.internetlivestats.com/total-‐number-‐of-‐websites/

@dawnieando from
CONTENT CRUFT
https://moz.com/blog/c
lean-‐site-‐cruft-‐before-‐it-‐
causes-‐ranking-‐
problems-‐whiteboard-‐
friday

@dawnieando from
Poor
quality
content
signals
build
up

over
time…

incremental
crawling
just
keeps
on

rolling
and
crawling…
and
gathering

signals

@dawnieando from
Source:
https://plus.google.com/u/0/+GlennGabe/posts/fXZw2BuSa5B
SIGNALS
OF
LOW

QUALITY
JUST

KEEP

COMPOUNDING

OVER
TIME

@dawnieando from
PEOPLE CANONICALIZE WRONG
ON
MULTIPLE
GENERATIONS
OF
SITES

@dawnieando from

@dawnieando from
GOOGLEBOT
GETS WHERE
WATER
COULDN’T
https://petermeadit.com/blog
/block-‐web-‐crawlers/

@dawnieando from
EVEN YOUR STAGING & DEV SITES
Found
with
a
very
simple
wildcard
*
site:
query

@dawnieando from
ARCHITECTURAL
&

SEMANTIC
CRUFT

@dawnieando from
SEMANTIC

LOSS
WONKY TOPICAL STRENGTH
HOW
MUCH

STUFF
DID
YOU

MOVE
AROUND

OVER
THE
YEARS?

@dawnieando from
YOU BROKE YOUR SILO STRUCTURE
Image
credit:
https://www.slideshare.net/patrickstox/nlp-‐sitemap-‐smx-‐2016-‐
patrick-‐stox-‐latest-‐in-‐advanced-‐technical-‐seo
SEMANTIC

LOSS

YOU
BROKE
YOUR
CORPUS

‘RELATEDNESS’
1st
level
relatedness
2nd
level

relatedness
MANY

SIGNALS

GONE

“You shall know a
word by
the company
it keeps”
(Firth,
1957)
(ITS
CO-‐OCCURRENCE
VECTOR)

CO-‐OCCURRENCE
Of
words
together

&

High
commonality
of

other
shared
co-‐occurring

words

@dawnieando from
RELATEDNESS
EXAMPLES
üEat
üBake
üCake
üPeel
APPLE AUTOMOBILE
üAccident
üTraffic
üDriver
üCar
üMotor
FURNACE
üHearth
üBlast
üFiery
üGas
üElectric
Miller,
G.A.
and

Charles,
W.G.,

1991.
Contextual

correlates
of

semantic

similarity. Langua
ge
and
cognitive

processes, 6(1),

pp.1-‐28.

@dawnieando from
‘CONCEPT DRIFT’
IS A THING
fuzzy difficult to perceive;; indistinct or vague.
synonyms: blurry, blurred, indistinct; unclear, bleary, misty, distorted, out
of

focus, unfocused, lacking
definition, low
resolution, nebulous;
Ill-‐
defined, indefinite, vague, hazy, imprecise, inexact, loose, woolly
"a
fuzzy
picture"
https://en.wikipedia.org/wiki/Concept_drift
AI
ALERT

@dawnieando from
BOOLEAN LOGIC – EXTREME CASES
OF TRUTH - (TRUE (1) OR FALSE (0))

@dawnieando from
FUZZY
LOGIC
• Rule
based
logic
• Been
around
for
20+

years
• Is
within
a
subset
of
AI

@dawnieando from
‘FUZZY LOGIC’ – DEGREES OF TRUTH
SEMANTIC

LOSS

@dawnieando from
FUZZY LOGIC – DEGREES OF TRUTH
0.8
Doc
ID
likely
to

be
a
correct
URI
to

choose
from
term
/

query
cluster

@dawnieando from
Semantics
&
concepts

relatednes may
be

‘secret
sauce’
when
it

comes
to
’precision’

over
‘recall’

@dawnieando from
TWO-PHASE
RANKING IN
A SEARCH
NODE
Presented
by
B
Cambazoglu at
European
Summer
School
Information
Retrieval
2017
– (Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,

2011.
Scalability
challenges
in
web
search
engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin

Heidelberg.)

@dawnieando from
URL
CRUFT

@dawnieando from
‘URL
CRUFT’
IS
A

THING
“characters relevant
or
meaningful

only
to
the
people
who
created
the

site,
such
as
implementation
details

of
the
computer
system
which
serves

the
page.
Examples
of
URL
cruft

include filename
extensions such

as .php or .html,
and
internal

organizational
details
such

as /public/or /Users/john/work/draft
s/.[9]”

(Wikipedia
Definition)

ALL
THE
RANDOM

URLS
YOU
CREATED

OVER
THE
YEARS
&

SITES
(EVEN
BY

ACCIDENT)

@dawnieando from
410 Gone
§ “Some,
we’ll
just
kill

off
with
a
410…”
§ “Then
the
URLs
will

be
gone”

@dawnieando from
https://www.youtube.com/watch?v=xp5Nf8ANfOw
THE
DIFFERENCE
BETWEEN
HOW
GOOGLE
TREATS
404
VERSUS
410s

@dawnieando from
302
==
Default 301
==
Intentional
404
==
Default 410
==
Intentional
“The
410
response
is
primarily
intended
to
assist
the
task
of
web
maintenance
by

notifying
the
recipient
that
the
resource
is
intentionally
unavailable
and
that
the
server

owners
desire
that
remote
links
to
that
resource
be
removed.”
(RFC
7231)
https://tools.ietf.org/html/rfc7231#section-‐6.5.9
ARE YOU SURE?
MAYBE YES

@dawnieando from
https://twitter.com/JohnMu/status/903904602617204738

@dawnieando from
DO NOT THINK 410s WON’T BE
RECRAWLED AGAIN
Source:
https://www.docsplace.org/4578/09/410-‐gone-‐stops-‐crawling-‐dead-‐urls/

@dawnieando from
“We
knew
there
was
content

there
at
some
point
so
we

just
swing
by
every
now
and

then
to
see
if
anything
came

back”
(John
Mueller,
2016)
In Reality… Gone Is Never Gone

@dawnieando from
A URL IS ’NOT’ CONTENT
IT IS A LOCATION WHERE A
RESOURCE LIVES / LIVED
IT MERELY JUST BOILS
DOWN TO A DOC ID MAPPED
TO TERM IDS IN A MATRIX

@dawnieando from
ZOMBIES
ARE
NEVER
GONE
NO
URLS
ARE

EVER
GONE

ONLY
THE
RESOURCE
THERE

IS
GONE
https://www.seroundtable.com/google-‐410-‐indexing-‐22584.html
5
YEARS
LATER

@dawnieando from
HOW ABOUT 14 YEARS LATER?
https://www.webmasterworld.com/google/4864613.htm
2
HOURS
ALIVE…

14
YEARS
LATER

@dawnieando from
SMALL TOPICAL URL
FISH
IN A BIG TOPICAL
POND
SEMANTIC

LOSS

@dawnieando from
COME TO OUR ANNUAL EVENT WITH
THE SAME NAME BUT A NEW URL EVERY YEAR

@dawnieando from
“COOL
URIs
DON’T

CHANGE”
Sir
Tim
Berners-‐Lee
(Inventor
of
the
World
Wide
Web)
https://www.w3.org/Provider/Style/URI
Attrubution:
By
Uldis Bojārs (Flickr.)
[CC
BY-‐SA
2.0
(http://creativecommons.org/licenses/by-‐sa/2.0)],
via
Wikimedia

Commons

@dawnieando from
YOU END UP WITH A CONGA LINE OF
LEGACY URLS, SUBDOMAINS
& VARIOUS SITE
PROTOCOLS
…In
the
URL
queue

@dawnieando from
URL_SEEN TEST
YOU CAN’T JUST KEEP TRYING TO JUMP
THE INDEXING QUEUE EITHER
PUSH
INDEXING PULL INDEXING
E.G.
FETCH
AS
GOOGLEBOT
&

SUBMIT
TO
INDEX,
XML

SITEMAP
SUBMISSIONS
VISITS
BY
NATURAL
CRAWLING

&
DISCOVERY
OF
URLS
/
URL

VISIT
SCHEDULING
/
REVISITS

@dawnieando from
CRAWL
CRUFT
IS
A

SYMPTOM

@dawnieando from
IMPORTANCE
TIERING
FOR SCALE
(EFFICIENCY)

@dawnieando from
TWO STEPS FORWARD & ONE STEP BACK
STRONG
CANONICAL
CONTEXT
URL
YES YES YES
NONO
YES YES YES YES YES
NO NO NO NO
AN
OTHER
OR
MULTIPLE
WEAK
ALTERNATIVES

@dawnieando from
PAST DATA ON CHANGE IS A GREAT
PREDICTOR OF FUTURE DATA
PREDICTION
BASED

PRIORITY

SCHEDULING
…
WHEN

THERE
IS

CONSISTENCY
“past
changes
to
a
page
are
a
good
predictor
of
future
changes.
This
result

has
practical
implications
for
incremental
web
crawlers
that
seek
to

maximize
the
freshness
of
a
web
page
collection
or
index.”
(

@dawnieando from
TO
BUILD

PROBABILITY
&

PREDICTABILITY

MODELS

@dawnieando from
BASED
ON
ROLLING

AVERAGES
/
SIGNALS
FROM
PAST
CRAWL
VISITS

@dawnieando from
‘Transitive’?? - ‘THE WHOLETREE IS ROTTEN’
Transitive
-‐ A
==
B
+
B
==
C
then
A
==
C
For
some
types
of
content
more
than

others
– e.g.
ecommerce/directories
but

not
news
SAMPLING

@dawnieando from
CRAWL
SAMPLES
ALSO

HELP
WITH
MODELLING

TO
MAP
DOCS
TO
TOPIC

RELEVANCE
&

RELATEDNESS

@dawnieando from
TOPICAL
DILUTION
&
URL
IMPORTANCE
DILUTION

RELATEDNESS
HELPS

WITH

‘GROUNDING’

(confirmation
of
other
signals)

@dawnieando from
WRONG
URL
RANKING
’SWAPPING
OUT’
(Especially

multiple

child
nodes)
SHARP
&

VOLATILE
RANKING

FLUX
SOME
SYMPTOMS

@dawnieando from
SOME
SOLUTIONS
§ How
can
you
change
the
hints

associated
with
your
site
for
better

rankings
and
SEO?

@dawnieando from
ACKNOWLEDGE
&

CALCULATE
THE
DEBT

@dawnieando from
THE
BOSTON
MATRIX &
SEO Cash
Cows

(High
converting

queries
&

URLs)
Dogs
Low
return,
low

conversion

queries
/
URLs
Question

Marks

(Jury’s
out)
?
PRIORITIZE

THE
DEBT
MARKET

GROWTH
MARKET

SHARE
Stars
High
potential.

Worth
more

effort

@dawnieando from
ADD EVERYTHING TO GSC FROM THE PAST
& PRESENT
THERE
MAY
STILL
BE
UNDETECTED
ACTIVITY
GOING
ON
THERE

@dawnieando from
IDENTIFY
PAGES
IN
QUERY
CLUSTERS

(QUERY
CLASSES
&
INTENT
MEETING
SAME

INFORMATION
NEED
CATEGORY)

@dawnieando from
REVIEW RELATIVE IMPORTANCE
SIGNALS OF INTERNAL LINKS
ARE
THESE

REALLY

AMONGST

YOUR

MOST
IMPORTANT

URLS?

@dawnieando from
BUT…
REVIEW THE
‘RELATEDNESS’
OF
INTERNAL
LINKS
TO PAGES
Domain URL
INTERNALLY
LINKING
PAGES
TO
THE

TARGET
URL
IS
THE
‘RELATEDNESS’
HIGHLY

RELEVANT
TO
ASSIST
WITH

CONTEXTUAL
&
SEMANTIC
SIGNALS?
IS

RELATEDNESS

HIGH?

@dawnieando from
FIND SITES ON THE SAME SERVER

@dawnieando from
YOU
NEED

TO
KNOW

WHAT’S
ON

THAT

SERVER
DIAGNOSE: HEAD BACK TO THE
SERVER

@dawnieando from
SOME QUESTIONS TO ASK
HOW MANY MICRO-SITES HAVE YOU HAD?
HOW MANY SUBDOMAINS?
HOW MANY OTHER DOMAINS?
WHO IS RESPONSIBLE FOR DOMAIN REG
WHO KNOWS WITHIN THE ORGANISATION?
WHO REGISTERED THE DOMAINS?
WHO CAN UPDATE DNS RECORDS?
ARE THESE SITES STILL ON SERVERS?
HAVE ANY OF THESE SITES HAD MANUALACTIONS?
HOW ARE THESE SITES REDIRECTED?
ARE THEY PARKED DOMAINS?

@dawnieando from
Source:
https://www.seroundtable.com/poll-‐log-‐files-‐seo-‐24523.html
NEARLY
1/3
OF

SEOs
SAY
THEY

DON’T
NEED
LOG

FILES

@dawnieando from
DIAGNOSE: SERVER LOG FILE ANALYSIS
BUT
WATCH
OUT
FOR

OTHER
TOOLS
EMULATING

GOOGLEBOT
AND
FILTER

THEM
OUT
ANALYSE
THE
LOGS
FOR

‘ALL’
YOUR
SITES
AND
‘ALL’

PROTOCOLS
TO
SEE
THE

CRAWL
PATTERNS

EMERGE
NB:
YOU
MAY

BE
LOOKING

AT
URLS

QUEUED

LONG
AGO

@dawnieando from
SET
UP
A
PLAN
TO

REPAY
THE
DEBT

@dawnieando from
MoSCoW Approach
MUST

HAVE
SHOULD

HAVE
COULD

HAVE
WON’T

HAVE

THIS
TIME

@dawnieando from
MoSCoW Prioritization
Source:
https://www.agilebusiness.org/content/moscow-‐prioritisation-‐0

@dawnieando from
SEO Refactoring Away
The Past Whilst Working
Towards The Future

@dawnieando from
TECHNICAL
DEBT
COMES
WITH
INTEREST
TO

BE

REPAID

VIA
REFACTORING

@dawnieando from
Refactoring
Definition
“Refactoring
…is
a
disciplined
technique
for
restructuring
an

existing
body
of
code,
altering
its
internal
structure
without

changing
its
external
behavior.

Its
heart
is
a
series
of
small
behavior
preserving

transformations.”
https://en.wikipedia.org/wiki/Code_refactoring

@dawnieando from
SEO

REFACTORING
HOUSE

KEEPING
WORKING
ON
THE

PAST,
PRESENT
&

FUTURE

SIMULTANEOUSLY
USING

APPROACHES

LIKE
MoSCoW
ONGOING

ITERATIVE

IMPROVEMENTS
ONGOING
ROLLING
AUDITS
PAYING
OFF

DEBT
‘A
BIT

AT
A
TIME’
MARGINAL

GAINS

@dawnieando from
SKIP
&
DIVERT
THE

DEBT

@dawnieando from
HAVE YOUR SAY IN CRAWLING ‘RULES’
Help
Google
Build
‘Crawling

Rules’
for
your
site
rather

than
wasting
time
on

‘sampling’
and
giving
a
bad

impression
GIVE
HELP
AND

GUIDANCE
WITH
THE

CRAWL
RULE
AND

HINT
BUILDING

@dawnieando from
Help
Google
Build

‘Crawling
Rules’
for

your
site
rather
than

wasting
time
on

‘sampling’
and
giving

a
bad
impression
BE
VERY

CAREFUL

@dawnieando from
REVISIT ALLPAST .HTACCESS FILES
Can
you
rewrite
the
rules
to
be

more
efficient
with
regex
or
cut
out

some
old
rules
still
firing

unnecessarily?
(CREATE
SHORTCUTS)
REMEMBER
.HTACCESS
RULES
RUN
IN
ORDER
OF

THEIR
APPEARANCE
IN
THE
FILE.

CAN
YOU
USE
WILDCARDS
TO
OPTIMIZE
OR
SKIP

STEPS?
.HTACCESS

SITE
1
.HTACCESS

SITE
2
.HTACCESS

SITE
3

@dawnieando from
Learn By Heart Regular
Expressions & How URLs
with Multiple Parameters Are
Handled
The
most
restrictive
parameter
blocked
overrules

lesser
restrictions

@dawnieando from
FIND & CHOP BACK REDIRECT CHAINS

@dawnieando from
REVIEW &
REMOVE
REDUNDANT
FILES

@dawnieando from
WHAT SUPERFLUOUS JAVASCRIPT &
CSS IS THERE UNNECESSARILY?

Avoid relative URLs
versus absolute URLs
(particularly in
Wordpress)

@dawnieando from
REVIEW & UNDERSTAND - THE
CANONICAL LINK RELATION
§ 30X
redirects
§ Canonical
tag
§ Href lang
§ HTTPS
protocol
§ Global
canonicalization
rules
§ URL
normalization
In
’ALL’
its
forms
RFC6596

@dawnieando from
CUT

THROUGH
TO

THE

DEVELOPERS

@dawnieando from
REBUILD
STRONG
SEMANTICS
&

‘RELATEDNESS’

@dawnieando from
BE
CAREFUL

WITH
THE

CONTENT

PRUNING

‘CHAINSAW’

Did
you
just

‘prune
away’

your
corpus

‘relatedness’?

@dawnieando from
UPCYCLING
URLs
RATHER
THAN
’REMOVE’

CONSIDER
‘IMPROVE’
EXPAND,
DE-‐GROUP
&
RE-‐GROUP

@dawnieando from
THOUGHTFUL

QUERY

CLUSTER

BASED

‘PRUNING’
,

‘CONTENT

MORPHING’

AND
‘QUERY

CLUSTER
RE-‐
GROUPING’

@dawnieando from
Pass Strong Clues - Highly Relevant New
Conceptual Structures
STRONG
SEMANTICS
&

CONCEPTUALLY

CO-‐OCCURRING

TERMS

@dawnieando from
WHAT
CORRELATES?
https://www.google.com/trends/correlate/search

@dawnieando from
SOLUTION: Wiki Page
Redirects on Topics
https://dbpedia.org/sparql
Wikipedia

Redirects
thesaurus.com
OR
A
GOOD
OLD
FASHIONED
THESAURUS

@dawnieando from
TIE
YOUR
THEMATIC
CORPUS

BACK
TOGETHER

@dawnieando from
USE
‘STRONGLY
CONNECTED

COMPONENTS’
(TOPICAL
HUBS

FOR
FOCUSED
CRAWLING)
TO

REAFFIRM
THE
SEMANTIC

STRENGTH
YOU
ONCE
HAD

@dawnieando from
BUILD
RICH
CONTENT
HUBS
FOR
PRIMARY
TARGET
TOPICS
Broder,
A.,
Kumar,
R.,

Maghoul,
F.,
Raghavan,
P.,

Rajagopalan,
S.,
Stata,
R.,

Tomkins,
A.
and
Wiener,
J.,

2000.
Graph
structure
in
the

web. Computer

networks, 33(1),
pp.309-‐320.
STRONGLY

CONNECTED

HUB

@dawnieando from
BUILD WELL CATEGORIZED AND
CONCEPTUALLY STRUCTURED
SITEMAPS
https://www.slideshare.net/p
atrickstox/nlp-‐sitemap-‐smx-‐
2016-‐patrick-‐stox-‐latest-‐in-‐
advanced-‐technical-‐seo

@dawnieando from
XML Sitemaps Are Your Friend… (Strong
Foundations)
They
help
to

pass

‘importance’

signals
to
URLs
But…
never

leave
them
to

just

autogenerate
without

periodically

checking
‘The

foundations’

underneath
a

site

@dawnieando from
CREATE WELL ORGANISED XML
SITEMAPS WITH IMPORTANT URLS

@dawnieando from
EXTERNALLY HOSTED XML SITEMAPS
• Take
back
control
• Jump
the
dev
queue
• Allows
for
custom
configuration
of
optimal

canonical
click
paths
• Allows
for
consistent
signals
of
importance
to

included
URLs
• Forget
about
setting
priority
• Forget
about
last
modified
• Even
a
simple
list
of
URLs
FTW
will
do
• Keep
them
organised for
granular
analysis
of

problem
site
sections

“Increase and decrease
importance via internal
link optimization to
signal key quality
sections”

@dawnieando from
EXCLUDE LOWER
QUALITY SITE
SECTIONS (for now)
Excluded

sections

@dawnieando from
DON’T
LET
THE

DEBT
HOLD

YOU
BACK
MAKE
GREAT
CONTENT
&

BRAND
BUZZ

@dawnieando from
A
‘TWO-‐OARED

ROWING
BOAT
GOES

FURTHER

@dawnieando from
THEN…
MONITOR
& BE
PATIENT

@dawnieando from
“It’s
simple
really,
the
businesses

seeing
growth
in
natural
search

are
those
implementing
technical

changes
successfully,
the
most

common
cause
of
decline
is
either

ignoring
technical
or
getting
it

wrong.”
Tim
Grice,
Branded3,
2017
https://www.branded3.com/blog/link-‐spam-‐migration-‐disasters-‐penguin-‐organic-‐growth-‐2017/

@dawnieando from
Positive Consistency
is KEY
’ROLLING
AVERAGES

CAN
GO
BOTH
WAYS’

@dawnieando from
APPENDIX
&
EDITORS

CUT

@dawnieando from
BUT WHEN DATA IS INCONSISTENT
FUZZY LOGIC MAY FAIL
‘DEGREES
OF

TRUTH’
MORE

BLURRED
/

VAGUE

1st
level
relatedness
A
measure
of
words

that
directly
occur

together
in
a
text
or

‘corpus’
(collection
of

documents
together)
’TWO
WORDS
WHICH

TEND
TO
CO-‐OCCUR

MUST
BE
RELATED’
CO-‐OCCURRENCE
VECTORS
EXAMPLES:
car/automobile,
coast/shore,
furnace/stove
(Miller
&
Charles,
1991)

2nd
level
relatedness
Share
common
words
they

co-‐occur
with
aside
from

directly
co-‐occurring

together
(both
appear
in

same
types
of
text
as
each

other
===
related
EXAMPLE:

FURNACE
&

OVEN
BOTH
SHARE
HEAT,

MOTOR
&
ROAD,
CAR
&

AUTOMOBILE
BOTH
SHARE

PASSENGERS
CO-‐OCCURRENCE
VECTORS

@dawnieando from
MORE Solutions
• Do
a
bit
of
‘up
front’
thinking
(AVOID
TECHNICAL
DEBT
IN

FIRST
PLACE)
• Measure
SEO
technical
debt
• Refactor
SEO
technical
debt
away
• Reducing
SEO
technical
debt
should
be
inbuilt
• Accept
some
SEO
(least
impactful)
technical
debt
is

necessary
for
agility
• ’Chip
away’
at
SEO
technical
debt

@dawnieando from
’Fuzzy’ URL Targets with Each Site Generation
EVERYTHING
GETS

A
BIT
BLURRED
‘Which
is
the
target
URL

again?

@dawnieando from
”The
URL
page
importance
score
can
be
retrieved
from
the
…
URL
history
log …or
it
can

be
obtained
by
obtaining
the
historical
page
importance
score
for
the
URL
for
a

predefined
number
of
prior
crawls
and
then
performing
a
predefined
filtering
function

on
those
values
to
obtain
the
URL
page
importance
score.”
Scheduler
for
Search
Engine
Crawler
https://www.google.com/patents/US8042112
DOC
ID CRAWL
1

IMPORTANCE

RECORD
CRAWL
2

IMPORTANCE

RECORD
CRAWL 3

IMPORTANCE

RECORD
CRAWL
4

IMPORTANCE

RECORD
CRAWL
5

IMPORTANCE

RECORD
CRAWL
6
IMPORTANCE

RECORD
DOC
ID
1 1 0.8 0.6 0.4 0.2 0
DOC
ID
2 0 0.2 0.4 0.6 0.8 1

@dawnieando from
Example
MoSCoW Prioritisation
MUST
HAVE SHOULD
HAVE COULD
HAVE WON’T
HAVE
THIS
TIME
Remove
infinite
loops Identify ideal
click

paths
Canonicalize to
superset e.g.
Page
title
rewriting
Redirect
true
dupes Check
server
log files Upcycle
conflict content e.g.
All
meta-‐description
Review
parameter
handling
Analyse queries
on

near-‐dupes
Strengthen
categories &

subcategories
(relevance)
Review
‘added-‐value’

difference
in
’similars’
Add
seasonal
&
TIME
IS
OF
THE
ESSENCE
content
pieces
(topical /

evergreen)
Review internal
link

popularity
of

important
pages
Build
topic
hub
static pages

(Strongly
connected

component)
Add
‘flow’
content
to

amplify
via
social
Check soft
404s Review
crawling
on

near dupes
&
similars
Sectional content
audit Add site
section

properties
in
GSC

Check
server
errors Review queries
on

similars
Add categorized
XML

sitemaps
Add
content
from sub
to

superset
&
canonicalize
PAST
PRESENT
FUTURE

@dawnieando from
URL
NORMALIZATION
Can be
problematic
and ‘crufty’
too
https://en.wikipedia.org/wiki/URL_normalization

@dawnieando from
SOLUTION – Think Carefully About
Creating New Dynamic Parameters
QUEUEING…
AGAIN
Waiting
for
good
URLs
to
be

visited…
AGAIN

@dawnieando from
SEMANTIC
DRIFT

@dawnieando from
A DOG IS NOT ALWAYS A DOG

@dawnieando from
BIG TOPICAL
URL FISH IN
A SMALL
TOPICAL
POND

@dawnieando from
TERM-FREQUENCY INVERSE
DOCUMENT FREQUENCY
Architectural,
URL,

software
&
content
cruft

can
also
skew
term-‐
frequency
inverse

document
frequency
AND
THE
QUERY
CLUSTERS
DOCUMENTS
BELONG
TO

@dawnieando from
YOU INHERITED SEO TECHNICAL DEBT
• Previous
content
/
link
manual
actions
• Previous
algorithmic
suppressions
• Past
infinite
loops
• “We’ll
SEO
it
after
launch”
• “SEO
is
dead…
so
we
won’t
optimise”
• Dodgy
URL
parameters
• SEO
is
a
‘one
time
audit’
• Misconfigured
URL
parameters
• Old
URL
crawling
‘rules
/
hints’

@dawnieando from
CRAWLING PATTERNS ARE
DEVELOPED FOR EFFICIENCY
- CRAWLERS TAKES ‘HINTS’AND ‘HINT
RANGES’ (rules / patterns)
Help
Google
Build
‘Crawling

Rules’
for
your
site
rather

than
wasting
time
on

‘sampling’
and
giving
a
bad

impression
GIVE
HELP
AND

GUIDANCE
WITH
THE

CRAWL
RULE
AND

HINT
BUILDING

@dawnieando from
“REL=NEXT / REL =
PREV” is NOT a form
of canonicalization

@dawnieando from
“301s and 302s are
BOTH forms of
canonicalization”

@dawnieando from
Href Lang is a form of
Canonicalization
(Internationalization)

@dawnieando from
History Log Records Include:
• URL
fingerprint
• Timestamp
(last
crawl
or
download

attempt)
• Crawl
status
(success
or
error)

(Response
code)
• Content
checksum
(binary
code)
• Source
ID
(accessed
from
cache
or

downloaded)
• Segment
identifier
(Crawl

segment
assigned
to??)
• Page
importance
(a
measure
of

importance
assigned
to
the
URL)

@dawnieando from
INSTEAD
OF
REMOVE…

CONSIDER…
DISTRACT

&
ITERATIVELY
IMPROVE

@dawnieando from
SYSTEM
&
PEOPLE

CRUFT

@dawnieando from
PEOPLE CHURN
INTERNAL
TEAM

CHURN
EXTERNAL

AGENCY
CHURN

“The average staff
turnover rate for
agencies is 17% each
year”
(Drum,
2017)

“We’ve Always Done It
This Way”
HIPPO

“What’s The Business Case?”
Mowarrr Data
Please

@dawnieando from
THINK CAREFULLY ABOUT URL CREATION
Not
EVERYTHING
is

worthy
of
its
own
URL
VARIANTS
STEMMINGS
PLURALS
RANDOM
TAGS
LONG,
LONG,
LONG

TAIL
PARAMETERS

@dawnieando from
’DANGLY’
NODES
AND
UNLINKED
SITES

@dawnieando from
A CAT IS NOT ALWAYS A CAT

@dawnieando from
ARE THEY CHIPS OR ARE THEY CRISPS?

@dawnieando from
MIXED CONTENT & MULTIPLE SITE
VERSIONS
http://www.itv.com/news/

@dawnieando from
MIXED
CONTENT &
MULTIPLE SITE
VERSIONS
http://www.itv.com/news/
BOTH
HTTP
&

HTTPS
FIGHTING

EACH
OTHER

@dawnieando from
ROGUE
INTERNAL

LINKS
TO
PREVIOUS

DOMAIN

@dawnieando from
410’s
DO
USE
CRAWL
BUDGET
(MAYBE
NOT

TOO
MUCH
ON
REVISITS,
BUT
THESE
THINGS

ADD
UP).

THEY
ALSO
STILL
NEED
TO
BE

DISCOVERED
WHICH
USES
BUDGET
https://twitter.com/dawnieando/status/906465965029969920

@dawnieando from
GENERATIONAL

CRUFT
CAN

SNOWBALL
• Past
infinite
loops
• Dodgy
URL
parameters
• Misconfigured
URL
parameters
• Old
URL
crawling
‘rules
/
hints’
• Old
‘importance
/
quality’

scores
• Filtered
dupes
&
near-‐dupes
• Mixed
messaging
canonicals
• 410s
still
being
revisited
• Internal
links
to
old
sites
/

protocols

@dawnieando from
“Failure is simply a few
errors in judgement
repeated every day”
Jim Rohn

@dawnieando from
The Generational ’Snail Trail’
• Old
XML
sitemaps
• Redirects
drop
away
on
old
site

.htaccess
• DNS
issues
• People
link
to
old
site
but
wrong

protocol
• Old
sites
no
longer
verified
in
GSC
• Not
all
protocols
redirecting
Leaving
it’s

slithery

footprint

@dawnieando from
History Log Records Include:
• URL
fingerprint
• Timestamp
(last
crawl
or
download

attempt)
• Crawl
status
(success
or
error)
(Response

code)
• Content
checksum
(binary
code)
• Source
ID
(accessed
from
cache
or

downloaded)
• Segment
identifier
(Crawl
segment
assigned

to??)
• Page
importance
(a
measure
of
importance

assigned
to
the
URL)
May
be

calculated
by

identifying

historical

importance

scores
based
on

past
X
number
of

crawls

@dawnieando from
EVERY
SINGLE
TIME
YOU
MIGRATE,
CHANGE
DESIGN,
REDIRECT,
REINVENT
A
SITE
/
URL
A
CLEAN
START
REDIRECTIONS
ANOTHER
STRUCTURE
FIRST
SITE

STRUCTURE
NEW
CRAWLING
‘RULES’

BUILT
CRAWLING

‘RULES’
BUILT
EVERYTHING

IS
‘200
OK’
MORE
URLs
MIXED
RESPONSE
CODES
REDIRECTIONS
‘FUZZINESS’
IS
EMERGING
NEW
CRAWLING
‘RULES’
BUILT
MORE
URLs
REDIRECT
CHAINS
&
MIXED

RESPONSE
CODES
NEW
SEO’s
DON’T

KNOW
THE
‘HISTORY’
TARGET
URLs
NOW
‘VERY
FUZZY’

@dawnieando from
SOLUTION: Wiki
Page Redirects on
Topics
https://dbpedia.org/sparql
Wikipedia

Redirects

@dawnieando from
Time Seems To Fly… The Older You Get
Your
new
site
URL
is
just

one
of
very
many
historical

URLs
on
your
IP
to
be

visited
periodically
A
tiny
fish
in
a
very

big
URL
pond
queue

@dawnieando from
A New Beginning
§ “A
new
website
will
solve
ALL
our
problems”
“Let’s
start
again”
“We’ll
just
migrate…
and
redirect

everything”

@dawnieando from
A
LONG,
LONG
TIME
AGO
• You
need
to
go
right
back
to
the
beginning
• What
domains
did
the
organisation EVER
register?
• Where
do
they
redirect
to?
• Is
it
via
301,
302
or
are
they
merely
parked
domains?
• Who
would
know?

Who
is
responsible?
• Verify
them
all
in
Google
Search
Console
• Some
of
these
may
EVEN
HAVE
PENALTIES
HISTORICALLY
• If
there
are
links
to
any
there
is
likely
still
crawling
activity
there
• Analyse logs
across
multiple
subdomains
&
protocols

@dawnieando from
SOME TYPES OF URL CRUFT
• INCORRECTLY
APPLIED
CANONICAL

TAGS

• CONFLICTING
HREF
LANG
&

CANONICAL
TAGS
• MIXED
CONTENT
• URL
SHORTENERS
• SESSION
IDS
• UTM
TAGGING
• OLD
AJAX
FRAGMENTS
• PARAMETERS
FROM
MULTI
FACET

DROP
DOWN
CHOICES
• .html,
.php,
.index.html,
.aspx
• LEGACY
URL
REWRITING
&

PARAMETERS
IN
.HTACCESS
FILES
• LEGACY
FOLDERS
WHICH
CONTRIBUTE

NO
MEANING
TO
SITE
ONTOLOGY
UNCRUFTY
www.myeasyurlwillmakeyouw
onder.com/resume
CRUFTY
www.myeasyurlwillmakeyouw
onder.com/resume.html
CRUFTY
http://nymag.com/scienceofus/2015/07/how-‐
to-‐recover-‐from-‐an-‐all-‐
nighter.html?om_rid=AAENcg&om_mid=_BTtF
a0B869PyJp&utm_content=buffer8fdd1&utm_
medium=social&utm_source=twitter.com&ut
m_campaign=buffer

‘RELATEDNESS’
(DISTRIBUTIONAL
SIMILARITY)
1st
level
relatedness
2nd
level

relatedness

@dawnieando from
IT’S
VERY

IMPORTANT…

YOU
STAY
OUT

OF
SERVER

ERROR
STATUS
500
‘Try
again’
intervals
likely
extended

between
each
failed
connection

attempt

@dawnieando from
“Forever,
And ever,
And ever,
And ever…
You’ll be a
URL”

@dawnieando from
LEGACY ISSUES VIA CANONICALS OR
REDIRECTION (COMMON MISTAKES)
• PAGE
CANONICALIZED
TO
IS
NOT
A
SUPERSET
OR

DUPLICATIVE
(IT
IS
NOT
RELEVANT
ENOUGH)
• 301s
TO
IRRELEVANT
PAGES
BECOME
SOFT
404
• FOLDING
UP
PRODUCT
PAGES
TO
CATEGORES
(PEOPLE

WERE
LOOKING
FOR
A
SPECIFIC
PRODUCT)
• CANONICALIZATION
TO
PAGES
WHEN
IN
THE
FUTURE

301
REDIRECT
TO
ANOTHER
URL
THEREFORE
NEGATING

THE
PAGES
CANONICALIZING
TO
THEM
• CONFLICTS
BETWEEN
HREF
LANG
AND

CANONICALIZATION

@dawnieando from
SOLUTION: Increase ‘Importance’ quickly of
target URLs
• Internal
link
optimization
• Canonicalise to
(if
relevant)
• Strengthen
up
importance
signals
• Inclusion
in
front
facing
HTML
and
XML

sitemaps
• Improve
the
content
&
keep
it
updated
• 301
redirect
to
(if
relevant
redundant

content)
• Topical
hubs
and
strong
information

views
to
navigate
users
&
add
relevance

@dawnieando from
SOLUTION: Reduce ‘Importance’ quickly of old
URLs
• Internal
link
UNOPTIMIZATION
• 410
• Dig
out
URLs
with
links
to
them
• Orphan
URLs
• Canonicals
to
HTTPs
• EXCLUSION
from
XML
sitemaps

(even
old
ones
on
the
server)
• Archiving
of
content

@dawnieando from
404
NOT

FOUND
&
410

GONE
§ “Of
course,
we

won’t
redirect

everything…”
§ “Not
everything

will
be
worth

redirecting”

@dawnieando from
“Usually
seeing
it
(410)
1-‐2

times
is
enough
for
us
to
drop

those
URLs
from
the
index”

John
M
on
Google+
(https://plus.google.com/u/0/+JohnMueller/posts/NEsqE7Sr4Z4)

@dawnieando from
410 Likely Get Deindexed Quicker
https://plus.google.com/+JohnMueller/
posts/NEsqE7Sr4Z4

@dawnieando from
“404
vs
410
doesn't
affect
the
recrawl
rate:
we'll
still
occasionally
check
to

see
if
these
pages
are
still
gone,

espectially when
we
spot
a
new
link
to

them”
John
Mueller,
Google+
2015
https://plus.google.com/u/0/+JohnMu
eller/posts/NEsqE7Sr4Z4
410 – DOES THAT PAGE NEED TO BE
REINDEXED?

@dawnieando from
The URL Generational ’Snail Trail’
• Old
XML
sitemaps
• Badly
coded
subcategory
&
attribute
parameters
• Redirects
drop
away
on
old
site
.htaccess
• Canonicalizing and
then
later
‘301ing’
‘context’
URL
(invalid
canonical)
• DNS
issues
• People
link
to
old
site
but
wrong
protocol
• Old
sites
not
verified
in
GSC
• Not
all
protocols
redirecting
• Relative
Wordpress URLs
appending
/wwws on
current
viewed
pages
• JS
fired
URLs
on
Language
drop
down
Internationalization
crawled
• Legacy
Ajax
issues
with
parts
of
page
content
pulled
• Canonical
URLs
NOT
a
superset
or
duplicate
of
canonicals
pointing
at
them
Leaving
it’s

slithery

footprint

@dawnieando from
INSTEAD
OF

REMOVE…

CONSIDER…

DISTRACT
&

ITERATIVELY
IMPROVE
STRATEGIC
USE
OF
INTERNAL
LINK

POPULARITY
REDUCE
IMPORTANCE
SIGNALS

TO
DIFFERENT
PAGES
INCLUDE
IMPORTANT
PAGES
IN

XML
SITEMAPS
EXCLUDE
LOW
IMPORTANCE

PAGES
IN
XML
SITEMAPS
INCLUDE
IMPORTANT
PAGES
IN

HTML
SITEMAPS

@dawnieando from
“404
vs
410
doesn't
affect
the
recrawl
rate:
we'll
still
occasionally
check
to

see
if
these
pages
are
still
gone,

especially
when
we
spot
a
new
link
to

them”
John
Mueller,
Google+
2015
https://plus.google.com/u/0/+JohnMu
eller/posts/NEsqE7Sr4Z4
ESPECIALLY IF
THERE ARE
LINKS TO IT

@dawnieando from
Aged ‘Patchwork Quilt’ Sites
A
LITTLE
BIT
OF
THIS
CMS
AND
A

LITTLE
BIT
OF
THAT
CMS
MANY
HISTORICAL
PARAMETERS

CREATED
&
CRAWLING
SAMPLE

PATTERNS

@dawnieando from
LACK
OF
PROCESS
OR

UNDERSTANDING
• Lack
of
process
or
understanding
• No
or
poor
documentation
to
work
to
• Insufficient
testing
facilities
&
staging
/

optimizing
environments
• Lack
of
collaboration
between
depts
• Parallel
development
&
version
control

issues
(too
much
happening)
• Small
improvements
left
till
last
• Business
pressures
/
business
case
demands
• Insufficient
‘up
front’
definition
(scope

creep)
LOTS
OF

REASONS

FOR

TECHNICAL

DEBT

@dawnieando from
A JAGUAR IS NOT
ALWAYS A JAGUAR
Disambiguation

TECHNICAL
DEBT
IS
NOT
ALWAYS
ABOUT
BAD
CODE
IT
OFTEN
COMES

AS
A
RESULT
OF

MINIMUM
VIABLE

PRODUCT

@dawnieando from
THESE
THINGS
ADD
UP
THEY
ALSO
STILL
NEED
TO
BE
DISCOVERED

WHICH
REQUIRES
INITIAL
CRAWLING
https://twitter.com/dawnieando/status/906465965029969920

LEGACY SITES COST
BOTH TO MAINTAIN &
IMPROVE
DOUBLE DEBT
DOUBLE INTEREST

@dawnieando from
REFERENCES

@dawnieando from
Sources & References
Bar-‐Yossef,
Z.,
Keidar,
I.
and
Schonfeld,
U.,
2009.
Do
not
crawl
in
the
dust:

different
urls with
similar
text. ACM
Transactions
on
the
Web
(TWEB), 3(1),
p.3
Broder,
A.Z.,
Najork,
M.
and
Wiener,
J.L.,
2003,
May.
Efficient
URL
caching
for

world
wide
web
crawling.
In Proceedings
of
the
12th
international
conference

on
World
Wide
Web (pp.
679-‐689).
ACM
Broder,
A.,
Kumar,
R.,
Maghoul,
F.,
Raghavan,
P.,
Rajagopalan,
S.,
Stata,
R.,

Tomkins,
A.
and
Wiener,
J.,
2000.
Graph
structure
in
the
web. Computer

networks, 33(1),
pp.309-‐320.
Cambazoglu,
B.B.
and
Baeza-‐Yates,
R.,
2011.
Scalability
challenges
in
web
search

engines.
In Advanced
topics
in
information
retrieval (pp.
27-‐50).
Springer
Berlin

Heidelberg.
Cho,
J.,
Garcia-‐Molina,
H.
and
Page,
L.,
1998.
Efficient
crawling
through
URL

ordering. Computer
Networks
and
ISDN
Systems, 30(1),
pp.161-‐172
Fetterly,
D.,
Manasse,
M.,
Najork,
M.
and
Wiener,
J.,
2003,
May.
A
large-‐scale

study
of
the
evolution
of
web
pages.
In Proceedings
of
the
12th
international

@dawnieando from
Grice,
T,
2017. Link
spam,
migration
disasters
and
Penguin
is
nowhere
to
be
seen
-‐
Organic
growth
in
2017 [ONLINE]
Available
at: https://www.branded3.com/blog/link-‐
spam-‐migration-‐disasters-‐penguin-‐organic-‐growth-‐2017/.
[Accessed
08
October
2017].
Olston,
C.
and
Najork,
M.,
2010.
Web
crawling. Foundations
and
Trends®
in
Information

Retrieval, 4(3),
pp.175-‐246.
Pandey,
S.
and
Olston,
C.,
2008,
February.
Crawl
ordering
by
search
impact.

In Proceedings
of
the
2008
International
Conference
on
Web
Search
and
Data

Mining (pp.
3-‐14).
ACM.
Olston,
C.
and
Pandey,
S.,
2008,
April.
Recrawl scheduling
based
on
information

longevity.
In Proceedings
of
the
17th
international
conference
on
World
Wide
Web (pp.

437-‐446).
ACM
Pandey,
S.
and
Olston,
C.,
2005,
May.
User-‐centric
web
crawling.
In Proceedings
of
the

14th
international
conference
on
World
Wide
Web (pp.
401-‐411).
ACM.
Pandey,
S.
and
Olston,
C.,
2008,
February.
Crawl
ordering
by
search
impact.

In Proceedings
of
the
2008
International
Conference
on
Web
Search
and
Data

Mining (pp.
3-‐14).
ACM

@dawnieando from
martinfowler.com.
2009. TechnicalDebtQuadrant.
[ONLINE]
Available

at: https://martinfowler.com/bliki/TechnicalDebtQuadrant.html.
[Accessed
03
October
2017].
https://martinfowler.com/bliki/TechnicalDebtQuadrant.html
Malte Ubi on
Twitter
-‐ https://twitter.com/cramforce/status/897502737268592640
Is
Technology
Debt
Bankrupting
Your
Competitiveness
– Accenture
2017
-‐
https://www.accenture.com/t20170504T221347__w__/ie-‐en/_acnmedia/PDF-‐43/Accenture-‐
Strategy-‐Technology-‐Debt-‐PoV.pdf
Project
Management
Certification.
2015. Project
Failure
-‐ Why
Projects
Fail
So
Often.
[ONLINE]

Available
at: http://4pm.com/2015/09/27/project-‐failure/.
[Accessed
30
September
2017].
https://patentimages.storage.googleapis.com/US8042112B1/US08042112-‐20111018-‐D00000.png
Randall,
K.H.,
Google
Inc.,
2010. Scheduler
for
search
engine
crawler.
U.S.
Patent
7,725,452.
https://patentimages.storage.googleapis.com/US8042112B1/US08042112-‐20111018-‐D00000.png
Randall,
K.H.,
Google
Inc.,
2010. Scheduler
for
search
engine
crawler.
U.S.
Patent
7,725,452.

@dawnieando from
The
Drum.
2017. On
trend?
The
Wow
Company
reports
on
what
the
average
UK
agency

looks
like
|
The
Drum.
[ONLINE]
Available

at: http://www.thedrum.com/opinion/2017/04/12/trend-‐the-‐wow-‐company-‐reports-‐
what-‐the-‐average-‐uk-‐agency-‐looks.
[Accessed
28
September
2017].

Cruft busting technical debt code smell and refactoring for seo - state of search

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Cruft busting technical debt code smell and refactoring for seo - state of search

Similaire à Cruft busting technical debt code smell and refactoring for seo - state of search (20)

Plus de Dawn Anderson MSc DigM

Plus de Dawn Anderson MSc DigM (20)

Dernier

Dernier (20)

Cruft busting technical debt code smell and refactoring for seo - state of search