The document discusses different types of duplicate content that can exist on websites, including perfect duplicates, near duplicates, partial duplicates, and content inclusion. It explains that search engines like Google have developed techniques to detect and handle different types of duplicate content differently. For example, perfect duplicates are filtered out before being indexed, while near duplicates or those with different URLs but similar text (DUST) may be indexed but not crawled as frequently to save resources. The document also discusses challenges around detecting different types of duplicate content and how search engines aim to return the most relevant result from a cluster of near-duplicate pages for a given query.
Creator Influencer Strategy Master Class - Corinne Rose Guirgis
Duplicate Content Myths Types and Ways To Make It Work For You
1. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
DUPLICATE CONTENT: MYTHS, TYPES &
WAYS TO MAKE IT WORK FOR YOU
2. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Duplicate Content Penalty ‘Myth’… It Just Won’t Die
Query Refinement Suggestion
Next Probable Queries on “near
duplicate urls can cause”
2017
3. At least 30% of the
web is a duplicate
of other pages on
the web
5. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
The Duplicate Content ‘Penalty’ Myth
‘Real’ duplicates (matching
content checksum) filtered and
not indexed
“Each content filter sends the
retrieved web pages to Dupserver
to determine if they are duplicates
of other web pages”
http://www.google.ch/patents/US20120317089
7. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Handling Near-Duplicate Content Attracted Lots of Research
§Dennis Fetterly
§Marc Najork
§Mark Manasse
§Ziv Bar-‐Yossef
§Monica Henzinger
§William Pugh
§Andrei Broder
Some Notable ‘Spot the
Difference’ Researchers
DETECTING DUPLICATES & NEAR-‐DUPLICATES
EARLY SAVES ON RESOURCES / EFFICIENCY
8. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Because… Near Duplicate Content is More Difficult to Detect
than Exact Duplicates
’Detecting Duplicate and
Near Duplicate Files’
IT’S AN ONGOING REAL
WORLD CHALLENGE
(Henzinger / Pugh, 2003, 2009, 2011,
2012, 2011, 2016)
These Google patents in the series
keep being ‘tweaked’ (A is not the
same as B)
9. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
A lot of busy Googlebots & potential for duplicates
• The web doubled in
size 2010 – 2012
• Another 1/3 by 2015
• Finite search engine
resources
• Processes automated
for scale
“I just never have
any ‘me’-‐time’
any more”
10. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Near
Duplicates
Do Not
Change
Often
SO… WHY
WASTE
RESOURCES
CRAWLING
THEM?
12. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
The Slow Page Evolution of Near Duplicates
“Clusters of near-‐duplicate documents
are fairly stable: Two documents that
are near-‐duplicates of one another are
very likely to still be near-‐duplicates 10
weeks later”
(Fetterly & Najork, 2003)
13. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
… The Raters Guidelines still ask raters to catch ‘dupes’
In Fact… There’s
a whole section
of the guidelines
dedicated to
them
2017
14. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Mostly Stable For Years… But… The Web is Always Changing
2017
15. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Near-Dupes are still doing strange things
John Mu at International Search Summit § Nearly the same but not
the same still causes
confusion
§ Particularly problematic
on internationalization
§ But applies to all sites
with pages not the same
but ’nearly-‐the-‐same’
2017
22. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
DUSTBUSTER - Do Not Crawl in The Dust… Ziv Bar-Yossef
Reduce crawling and wasted
resources to low importance pages
CAVEAT: IT IS NOT
KNOWN WHETHER
THIS IS BEING USED
AT ALL. RESEARCH
AND THEORY
§ Builds crawling ‘rules’
§ Detects duplicate content
URL patterns
§ From small ‘sampling’ visits
§ Swerves ‘DUST’
§ DUSTBUSTER
§ Saves crawling resources
§ Potentially Popular CMS
configurations URL
parameters detect ‘DUST’
2003
38. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Shingling
A rose is a rose is a rose
N-‐Gram
(Where ‘n’ is no.
words (tokens) in
snapshot)
[A rose is a]
[rose is a
rose] [is a
rose is] (4)
39. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
SHINGLE
VECTORS
SUPERSHINGLE
MEGASHINGLE
Shingles, Supershingles & Megashingles
WORD ==
TOKEN
41. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
http://corpus.tools/wiki/Onion
N-‐gram
length
(word
string)
POTENTIAL EXAMPLE
42. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
http://corpus.tools/wiki/Onion
Dup Content
Threshold
e.g. 0.5
(50%)
POTENTIAL EXAMPLE
43. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G., 1997. Syntactic clustering of the web. Computer
Networks and ISDN Systems, 29(8-13), pp.1157-1166.
44. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
“We have developed an
efficient way to determine
the syntactic similarity of
files and have applied it to
every document on the
World Wide Web”
(Broder et al, 1997)
Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G., 1997. Syntactic clustering of the web. Computer
Networks and ISDN Systems, 29(8-13), pp.1157-1166.
46. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Multiple Title
Candidates For A
Query
DYNAMIC,
CONTEXTUAL
SEARCH
49. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
DUPLICATE CONTENT
TYPE – NEAR DUPE
(QUILTING)
UNIQUE
PARAGRAPH
EXTERNAL
SYNDICATED
EXTERNAL
SYNDICATED
EXTERNAL
SYNDICATED
HEADER - TEMPLATE
FOOTER - TEMPLATE
UNIQUE
PARAGRAPH
A
S
I
D
E
51. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
UNIQUE
MAIN
CONTENT
BUT ITS
CONTENT IS
INCLUDED
ELSEWHERE
TEASER ‘INCLUDED’ ELSEWHERE
53. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Pages that look very different
but meet the same user
information need equally
55. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Possible Treatment of Near-Duplicate Query Candidates
“If more than one candidate is
determined to be part of a
’search query cluster’, the most
important one based on factors
such as relevance, freshness,
importance is returned. The
others are eliminated.”
(Henzinger / Pugh, 2012,2016)
Last updated
2016
58. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
How Are Choosing Strategies Catered For in Ecommerce?
§ FACETED NAVIGATION
& WEBSITE FILTERS ==
Allows for ‘Elimination by
Aspects’
§ PAGINATION == Reduces
‘Too Much Choice’ effects
§ SORTING == Caters for
‘FIRST / BEST’ choosing
strategies
CHOICE-‐
ASSISTING
FUNCTIONALITY
HEURISTICS
59. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
And with these choice-assisting functionalities come…
“Exponentially
multiplicative
URLs”
60. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Exponentially Multiplicative URLs From Faceted Navigation…
100 DRESSES
5 COLOURS
10 SIZES
2 LENGTHS
4 SUPPLIERS
100 x 5 x 10 x
2 x 4 =
40,000
URLs
61. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
And that’s without HTTPS, WWW/non or internationalization
100 DRESSES
5 COLOURS
10 SIZES
2 LENGTHS
4 SUPPLIERS
100 x 5 x 10 x
2 x 4 =
40,000
URLs
X 2 BECAUSE…
HTTPS VERSION
80,000
URLs
X 2… BECAUSE…
WWW / NON
WWW VERSION 160,000
URLs
X 5…
BECAUSE…
EN / FR / ES /
DE / IT (e.g.)
800,000
URLs
63. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
THAT’S A LOT
OF URLs FOR
100 DRESSES
Bored Googlebot
(Unrelated to speed)
67. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
The Canonical Tag - Otherwise Known As… RFC 6596
‘THE
CANONICAL
LINK
RELATION’
2012
68. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
‘The Canonical Link Relation’ – RFC6596 Is Still Adhered To
2017
69. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
50% OF SEO’S
“SEARCH ENGINES HAVE
IGNORED CANONICAL TAGS
THEY HAD IMPLEMENTED”
2017
71. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
There Are Many Signals To Consider In Canonicalization
404 & 410
301
302, 303, 307
Valid canonical from ‘context’ URL to valid target
Fall back to default pre ‘Canonical Link
Relation’duplicate handling signals
Valid href lang (if present and applicable)
Manual Action SUPER STRONG
STRONG -‐ DIRECTIVE
STRONG -‐ DIRECTIVE
STRONG -‐ DIRECTIVE
STRONG -‐ HINT
STRONG -‐ HINT
DEFAULT
ALL NEED TO BE IN UNISON
HTTPS (Google Specific)
72. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
“REL=NEXT / REL =
PREV” IS NOTA FORM
OF CANONICALIZATION
2017
77. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
If a canonical is not deemed to be valid
there is likelihood the pre-‐RFC6596
Canonical Link Relation treatment of
duplicates and near-‐duplicates will be
applied:
Such as ‘internal links’
COMMON CANONICAL MISTAKES
78. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
“301s AND 302s ARE
BOTH A FORM OF
CANONICALIZATION”
2017
79. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Don’t canonicalize from an
”index” to a “noindex or vice-‐
versa because this means the
pages are NOT the same.
The canonical will likely be
ignored
COMMON CANONICAL MISTAKES
If “href lang” references an
alternative which does not
match a canonical link the
canonical will likely be
ignored
84. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Hubs &
Authorities
BOWTIE OF THE WEB
Build Strongly Connected Components
87. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Focused
Crawling
CRAWLING
CONTENT ON A
SPECIFIC
TOPIC FOR
EFFICIENCY
88. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
The ‘Mere’ Categorization Effect (Phenomenon) FTW
Simply by labelling products /
items as being part of a
category regardless of label
appears to increase
perception of variety &
positive experience (Mogliner
et al, 2003)
HUMANS LOVE CATEGORIES
TOO… IT IS A PHENOMENON
89. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Homonyms contribute to need for query refinement
HOMONYMS –WORDS THAT ARE SPELT OR PRONOUNCED
THE SAME BUT HAVE DIFFERENT MEANINGS
ROSE
EVENING WATCH
SINK
BACK
ARMS
BOW
CHECK
STRENGTHEN DIFFERENTIAL
CONTEXT
91. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
LOCAL NAVIGATION RELEVANCE
TABLE OF CONTENTS STYLE IN
PAGE NAVIGATIONAL HEURISTIC
FOR SEARCH ENGINE AND
HUMAN
PAGINATED TAB THROUGH ON
SECTIONS OF REVIEW
GRANULAR
RELEVANCE
92. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Parameter Handling
“IS PARAMETER-‐HANDLING A
WAY TO HELP GOOGLE BUILD A
SET OF ‘DUSTBUSTER
CRAWLING RULES’ EARLY?”
MAKE THE
RULES
93. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
ADD VALUE TO NEAR DUPES
(INFORMATIONAL VIEWS (INFORMATION
ARCHITECTURE)
94. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
INFORMATION VIEWS
ADDING VALUE AND
PASSING STRENGTH TO
CANONICAL TARGETS
97. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
BUILD STRONG SECTIONS
REVIEWS
BLOG
BUYING
GUIDES
COST
CALCULATORS
COMMERCE
MAIN SITE THEME (ONTOLOGY
SEMANTICS RULE
UGC
99. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Related Content Mostly Adds Value To Other Content
That content is ‘stitched’ from elsewhere
But it is VERY useful overall & helps with
searcher ‘foraging’
To create context for what it links out to
100. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation SubtitleNOT Filtered before indexing
Doc IDs meeting
contextual
information needs
-‐ 1, or 2 pages
(max) chosen at
query run-‐time
101. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation SubtitleNOT Filtered before indexing
Fighting with each
other to be ‘THE’
result
Seems like
‘dilution’
103. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
CAN YOU ‘IMPROVE’, ‘DE-GROUP’ OR
‘REMORPH’ … RATHER THAN ‘REMOVE?
104. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
THAT’S A LOT
OF URLs FOR
100 DRESSES
Is The Difference Substantively Different To Queries?
106. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Content
meeting
informational
needs equally
treated
different TO
DUPLICATES
113. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Understand The Canonical Link Relation Rules – RFC6596
The target (canonical) IRI
MUST identify content that
is either duplicative or a
superset of the content at
the context (referring) IRI.
115. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Without &filter=0 Appended to end of Query
https://www.google.co.uk/search?q=red+dress
es+size+10+long+sleeves&oq=red+dresses+siz
e+10+long+sleeves&aqs=chrome.0.69i59.1257
0j0j7&sourceid=chrome&ie=UTF-‐8
NOBODY HAS MORE
THAN ONE LISTING
116. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
With filter=0 Appended to end of Query
https://www.google.co.uk/search?q=red+size+
10+dresses+long+sleeves&oq=red+size+10+dr
esses+long+sleeves&aqs=chrome..69i57.13605
j0j7&sourceid=chrome&ie=UTF-‐8&filter=0
ALL SITES HAVE AT
LEAST 2 LISTINGS
MISSED
OPPORTUNITIES
120. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Similar Content – Query Refinement SERPs
NOT FILTERED
NOT NEAR-‐DUPES
Does the searcher
want ‘gas
engineers, heating
engineers, central
heating?’
121. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
CONFUSING DUPLICATE, NEAR-‐
DUPLICATE (DUST) AND SIMILAR
CONTENT COULD COST YOU
DEARLY
Maybe a lot of people are confused by duplicates?
§ Be careful about canonicalizing
when unnecessary
§ True duplicate content & near-‐
dupes are query and category
agnostic
§ Similar is not duplicate
§ You may still have the answers
to different queries based on a
small important difference
§ AT LEAST 4 TYPES OF
DUPLICATE CONTENT
2017
124. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Problems With The Many ‘Faces’ of Faceted Navigation
https://webmasters.googleblog.com/2014/02/faceted-‐navigation-‐best-‐
and-‐5-‐of-‐worst.html -‐ Wednesday, February 12, 2014
Example of faceted navigation:
http://www.example.com/category.php?category=gummy-candies&price=5-
10&price=over-10
Facet means ‘little faces’ (USEFUL TRIVIA)
125. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Relation Links – ‘Web Linking’
https://tools.ietf.org/html/rfc5988
Web LINKING – RFC 5988
INTERNET ENGINEERING TASK FORCE
126. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
Internationalization – An Additional Layer of Complexity
‘TAGS FOR IDENTIFYING
LANGUAGES – rfc 5646
https://tools.ietf.org/html/rfc5646
INTERNET ENGINEERING TASK
FORCE
127. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
A Solution - The Introduction of Href Lang
Wikipedia page on href lang
Rules on href lang
https://support.google.com/webmasters/answer/182192?hl=en&ref_topic=2370587 -‐
MULTINATIONAL & MULTILINGUAL SITES AND HREF LANG
https://support.google.com/webmasters/topic/2370587?hl=en&ref_topic=4598733 -‐
HREF LANG Google
https://support.google.com/webmasters/answer/2620865?hl=en&ref_topic=2370587 -‐
USE A SITEMAP FOR HREF LANG
https://support.google.com/webmasters/answer/6144055?hl=en&ref_topic=2370587 -‐
LOCALE AWARE WITH GOOGLEBOT
CRAWLING
128. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
INTERNATIONALIZED RESOURCE INDICATOR
IRI
Internationalized Resource Identifiers (IRIs)
RFC 3987
https://tools.ietf.org/html/rfc3987
INTERNET ENGINEERING TASK FORCE
130. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
References & Sources
Fetterly, D., Manasse, M. and Najork, M., 2003. On the evolution of clusters
of near-‐duplicate web pages. Journal of Web Engineering, 2(4), pp.228-‐246.
Broder, A.Z., Glassman, S.C., Manasse, M.S. and Zweig, G., 1997. Syntactic
clustering of the web. Computer Networks and ISDN Systems, 29(8-‐13),
pp.1157-‐1166.
Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R.,
Tomkins, A. and Wiener, J., 2000. Graph structure in the web. Computer
networks, 33(1), pp.309-‐320.
Mogilner, C., Rudnick, T. and Iyengar, S.S., 2008. The mere categorization
effect: How the presence of categories increases choosers' perceptions of
assortment variety and outcome satisfaction. Journal of Consumer
Research, 35(2), pp.202-‐215.
131. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
References & Sources
http://www.seobythesea.com/2008/02/new-‐google-‐process-‐for-‐detecting-‐
near-‐duplicate-‐content/
Pugh, W. and Henzinger, M.H., Google Inc., 2016. Detecting duplicate and
near-‐duplicate files. U.S. Patent 9,275,143.
Alonso, O., Fetterly, D. and Manasse, M., 2013, December. Duplicate news
story detection revisited. In Asia Information Retrieval Symposium (pp. 203-‐
214). Springer Berlin Heidelberg.
RFC 5988 – The Canonical Relation Link -‐ https://tools.ietf.org/html/rfc5988
Fetterly, D., Manasse, M. and Najork, M., 2003. On the evolution of clusters
of near-‐duplicate web pages. Journal of Web Engineering, 2(4), pp.228-‐246.
132. @dawnieando from @MoveItMarketing
Click To Edit Presentation SubtitleClick To Edit Presentation Subtitle
References & Sources
Najork, M., 2012, August. Detecting quilted web pages at scale.
In Proceedings of the 35th international ACM SIGIR conference on
Research and development in information retrieval (pp. 385-‐394). ACM
Source: Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S.,
Stata, R., Tomkins, A. and Wiener, J., 2000. Graph structure in the
web. Computer networks, 33(1), pp.309-320.
.