Exploring how you can harness the huge amounts of data available to build an effective, empirically-led SEO strategy using machine learning resource such as natural language processing (NLP). Including useful and practical tips on areas such as topic modelling, categorisation and clustering, so you can get started on using NLP in your own SEO strategy right away.
Croud Presents: How to Build a Data-driven SEO Strategy Using NLP
1. How to build a data-driven SEO
strategy using NLP
Daniel Liddle, Croud - BrightonSEO Spring 2021
2. Croud in numbers.
2011 222
Croud founded in
London
Internal staff - all
shareholders
2,400
Markets covered
In-market specialists - or
‘Croudies’
118
86 22
In-house developers
building our tech stack
Languages spoken
3. A full suite of digital growth services
Developing strategies across a suite of services for sustainable growth, driving immediate business impact.
Performance
Content
Creative
CRO
SEO
Experience
Paid social
Paid search
Programmatic
Amazon
Shopping
Feed management
Analytics &
measurement
Reporting & Data
visualisation
Data science
Localisation
Strategy & Planning
Marketing in China &
Japan
Experience
Experience
Commerce
Experience
Data solutions
Experience
International
4. About me
Pros:
● 6 years of digital marketing experience
● Worked extensively across content, technical,
SEO strategy, data/analytics etc.
● Passionate about machine learning & Python.
Cons:
● Nottingham Forest supporter
● Hair has become outrageous during the
pandemic
● Unimpressed by tapestry
5. “
Creating a transparent narrative combined with a
strategic direction & plan to improve organic search
traffic and drive relevant users.
What is an SEO strategy?
6. G
row
th
U
tilisation
O
ptim
isation
M
easurem
ent
Identifying opportunities & new territories to target.
Growth
Maximising your current site’s organic search footprint.
Optimisation
What your activities are going to look like and how you
prioritise.
Utilisation
How you’re going to measure success & forecast a
ROI in terms of efforts.
Measurement
4 pillars of a successful SEO strategy
7. G
row
th
U
tilisation
O
ptim
isation
M
easurem
ent
Keyword research is prone to biases of a brief, vanity
metrics, time needed etc.
Lots of data to contend with, can be hard to prioritise.
Growth
Hard to quantify opportunities.
Data can be conflicting.
Optimisation
Direction from data can be misinterpreted.
Utilisation
Forecasting for SEO can be difficult because of updates,
trends, creating upper and lower yields for when SEO
activity commences.
Measurement
...but there’s difficulties
8. It’s more than just keywords
Intent
What is the user looking
for when they type in a
keyword?
This can broadly be split into 3
areas:
● Transactional
● Informational
● Navigational
Accessibility
Can users access your pages
quickly and is the readability
adequate?
This is where we start to think about
the more technical aspects like
page speed or core web vitals.
But also the text which is on the
page.
Format
What does the structure of
the page need to look like?
How best can you meet the user
needs in accordance with the
SERP.
These formats can come in the
form of:
● Product page
● A how to
● Listicle
● FAQ
Authority
How trustworthy is your
site?
Is you content/products/brand
receiving citations & links from
authoritative sites.
1 2 3 4
Caveats
&
Nuance
9. “
Creating a transparent narrative combined with a
strategic direction & plan to improve organic search
traffic and drive relevant users.
The “story” is important
With so much data available we need to be able to quantify that it to
something that anyone can understand.
10. We need narrative
Not everyone is a data scientist, statistician or even working
in SEO.
Being able to clearly communicate SEO activities to peers, clients and the
wider masses is crucial. If someone ask you what the SEO strategy is you
don’t to have to pull out a spreadsheet or a deck with 100s of slides.
Having that summary in your head is almost like having a traditional three
act structure:
● State of play - Where you are now.
● Action - What you are doing.
● Climax - Where you're going to be.
The Gilgamesh tablet which is over
4000 years old.
11. Search is ever changing
With core changes to search algorithms like BERT and other
elements becoming more powerful like voice search. Google
& other search engines want us to talk more conversational
with machines.
This means that search queries are becoming more broader
and with that means more data. Broad keywords that get lots
of search volume are now being siphoned into longer tail
keywords as they better match user intent.
For SEO it’s
getting harder
12. Why is intent important?
Figuring out the intent of our search queries to serve relevant search results. It works - according to the
American Customer Satisfaction Reports, 79% of Google’s users were satisfied with their results.
13. “
How they’re doing it
These improvements are oriented around improving
language understanding, particularly for more natural
language/conversational queries, as BERT is able to help
Search better understand the nuance and context of
words in Searches and better match those queries with
helpful results.
Particularly for longer, more conversational queries, or
searches where prepositions like “for” and “to” matter a
lot to the meaning, Search will be able to understand the
context of the words in your query. You can search in a
way that feels natural for you.
Understanding searches better than ever before |
Google
14. “
Natural Language Processing
(NLP)
Natural Language Processing or NLP is
a field of Artificial Intelligence that
gives the machines the ability to read,
understand and derive meaning from
human languages.
15. It’s difficult for a computer to dissect words into columns
or rows and process them for what is called a relational
database, which is a lot like a spreadsheet.
There are also aspects to language that can be difficult
to interpret.
● Spelling errors
● Spacing
● Symbols
● Colloquialism
● Words that could be a noun or a verb
The way that our language is constructed is what is called “unstructured data”.
Sorting the chaos
16. Considering that there are now tonnes of rich resources and ease of accessibility on how to use Natural Language
Processing, we can all now utilise disciplines once restricted to data scientists. But why?
Now more than ever, we have
access to huge amounts of data
sets from Google Search
Console to Semrush.
Lots of data
However, with all this data at
hand, getting the actionable
insights we need at a top level
like keyword intent is rather
time-consuming.
Search is getting messy
Post-BERT, Google is tapping
into more natural conversational
searches that will become harder
to quantify.
Search is changing
You can save your brain some
computing power and gain core
strategy recommendations plus
actions.
Critical insights
4 reasons why SEOs need to utilise NLP
17. You don’t have be a developer
Just searching for “python for SEO” you will see
there is a plethora of resources online.
Collaborate
Share your projects, build on other, the
communal aspect is brilliant and helps others.
Google Colab is your friend
A Google research product which emulates a
Jupyter Notebook which is easily shareable and
usable in browser
Python is essential
Python is a highly dexterous programming language which can help you automate & process
objectives.
The benefits of using Python is its simplicity & support especially within the SEO community.
19. Topic modelling, clustering, taxonomy etc.
Lots of keywords are semantically
similar and variations of keywords
may bring near similar results so
you want to be able to cluster
these queries together.
When doing keyword research you have a lot of keywords to contend with, categorising these keywords gives you a top
level view which can show you areas with the most volume but can also inform things like IA.
Why?
Using Python!
How?
SUM of Volume vs category
20. github.com/jroakes/querycat
Querycat is a demo repository created
by JR Oakes which is helpful for
categorisation of keywords, plus the
ability to use BERT for visualisation.
It’s simple|(ish) to use, the github page has a
Google Colab notebook with the scripts already
set up.
21. Apriori algorithm
A priori in Latin, meaning from the former - generally meaning
self-evident from a statement not based on experience.
But in this case we’re talking about the Apriori algorithm, which uses Association
Rule Mining (ARM) which in itself is a machine learning method that is able to build
correlations between itemsets.
ARM is used across the web for things like recommendation attributes for
purchases or content.
Using this on a set of keywords allows you to categorise keywords within a strict
limit.
22. Example
In this instance I looked at all the
keywords associated with ‘hats’ in
Semrush (around 50,000), ran the
CSV via Apriori, which was able to
topic model and categorise the
keywords.
23. Creating your search universe
By quantifying all that data you can start to see patterns and areas that you want to target. You can even use this
information for building out your information architecture and taxonomy.
Hats (Search Universe)
SUM of Volume vs category
24. Intent classification
Knowing what keywords to target doesn’t tell you the full picture and you’ll want to be able to extract the user intent for
better understanding of what type of content or page that you need to create.
By using a pre-trained data set combined with Ludwig (Created by Uber engineering) and powerful Google machine learning products like Tensorflow.
You can map each keyword to a specific intent. Ranksense have a pretty nifty guide on how to do this which shouldn’t take more than an hour to
complete.
Keyword
Transactional
Navigational
Informational
26. Quantifying footprint
Using the same method of topic
modelling mentioned earlier
(Apriori algorithm) you can create a
quantified footprint.
Using data based on current rankings either via. Paid tools or even
using Google search console you can see key areas to tackle.
Approach
Mix that with using intent
classification and you can get key
insights into where your activities
should focus on.
Go further
Click and impressions
28. “
Entities
According to a Google patent called ‘Question
answering using entity references in
unstructured data’:
...an entity may be a person, place, item, idea, abstract
concept, concrete element, other suitable thing, or any
combination thereof. Generally, entities include things or
concepts represented linguistically by nouns.
● Google uses entities to find broad relationships
with keywords e.g. James Joyce matched to
being a writer, which in turn helps with matching
intent and contextual search.
● Using this we’re able to see entities possibly
known to Google, which is powerful in knowing
how Google might interpret the text.
29. Salience
You’ll see a salience score under each
entity within the panel which indicates
the NLP model’s idea of relevancy of the
entity within the entire document.
30. Combining scraping content from your competitors with your own content and comparing via. Google’s
natural language tool you can see if your content is aligned with what surfaces on the first page of SERP.
1. Map your URLs to your target keyword.
2. Scrape content from your site using Trafilatura (web scraping tool).
3. Use Google’s natural language tool to extract the main entity on the
page with salience score.
4. Then using your target keyword as the query, plug this into
Querycat to gather the URLs for the first page of Google.
5. Scrape this content using trafilatura and the main entity/salience
score.
What you should be able to see is whether the content you have matches
with the competitor URLs and should help towards optimising accordingly.
Will give you an overall idea of how much work needs to be done to pages
in order to align with user intent.
Approach
Advanced: Entity matching
Competitor content
Entity
Primary content
Entity
Match?
Salience Score
32. Key takeaways
To start actioning insights from Googe’s Natural Language
Demo, there’s key research you will need to look at:
1. What does the tool say is the most relevant entity in your text?
Does it relate to the keyword(s) targetted?
2. Is there a main entity in the text that isn’t being picked up?
3. Study the structure of competitor URLs and compare its relation
to your own copy.
4. Look at the key attribute of language use such as passive,
assertive, positive, neutral and negative.
5. Is it able to semantically relate the text back to a relevant
category?
33. Sentiment
Google describes sentiment analysis as
“... inspects the given text and identifies
the prevailing emotional opinion within
the text, especially to determine a writer's
attitude as positive, negative, or neutral.”
Why is this important?
● You can analyse whether the top results
are positive, negative or neutral.
● This can help with understanding the
intent behind search and how machines
are interpreting the consensus.
● Score of the sentiment range from -1.0
(Very Negative) to 1.0 (Very Positive).
34. Syntax
Syntax looks at how copy is
structured and the relationship it
shares within the context of the
corpus.
It can give really good insight into how
NLP is able to distinguish attributes within
the dataset and how it is able to find
non-linear connections in sentences.
35. Categories
It is also able to classify the content based on the text with a confidence score.
There are over 600 categories within the database.
36. Catering for SERP features
Keyword
Focus keyword for the copy.
Literal Explainer
Articulate in text form a vivid description of the
keyword.
Relevancy
Origins and context.
37. Adding value to content
[Keyword] [Literal Explainer]. [Relevancy]. [Practicality]. [Resolve].
Keyword - Focus keyword for the copy.
Literal Explainer - Articulate in text form a vivid description of the keyword.
Relevancy - Origins and context.
Practicality - Its use, its look, longevity or benefit.
Resolve - Interest
“Braid or Plait is typically a flat, solid, three-stranded patterned hairstyle but there are
many variations of complex decoration which can be formed. Braided hairstyles have been around for
centuries starting in the bronze age and is now mostly associated for women with natural hair. If proper care
is taken this distinctive unique look can last up to 12 weeks and provides excellent protection for healthy hair.
Worn by such celebrities as Beyonce, Naomi Campbell and Taylor Swift, you're sure to find a plethora of
inspiration from our articles below.”
38. Large beauty publication ranking with 60k+ keywords on page 1.
Optimise existing articles in top 10 position to
follow the boilerplate in the first paragraph.
Implemented in Q1/Q2 2020.
Approach
Informing content via Google’s tool works
Strong rankings but key retail space being
increasingly owned by SERP features and
position zero.
Issue
Increase in gained
featured snippets.
153%
Increase in URLS in
pos.1-3.
146%
39. Summarising the text with Tensorflow
Tensorflow is an open-source platform for machine
learning created by Google and is extremely powerful.
We’re able to to utilise this quite easily by using the transformers
created by the huggingface team.
The one we are interested in is the SummarizationPipeline where we
can use T5 (Text-to-text transfer transformer) using the Tensorflow
framework.
42. How it looks with evergreen content
Topic of URL Generated Text
Itchy Scalp Having an itchy scalp after hair dye is a fairly common issue. If you do have an itchy scalp after hair dye, make sure you wash
your hair and scalp gently, but thoroughly.
Washing Hair Is there a better feeling than walking out of the salon with a freshly-dyed ‘do? Whatever color you’ve opted for, it’s
important to think about your hair texture when creating a washing schedule.
3c Hair
3C hair is a curly hair type that is made up of tight coils with volume and lots of strands that are packed together to create this
texture. 3C hair is first and foremost curly hair, so all the litanies hold true: It has a fuller volume and it’s prone to frizz and
changes in weather and climate.
43. Create meta descriptions using BERT
Alternatively if you’re new to Python,
Andrea Volpini has created an excellent
article how to do this with a step-by-step
Google Colab template that allows you to
create meta descriptions from your
website in minutes which is worth
checking out.
45. Benchmarking
If you’ve followed the method for topic modelling you
should be able to find key areas you want to target.
● For penetrating new territories you can
set a KPI to gain a footprint within
“bucket hat” sphere for instance.
● For optimisations you could look to
improve CTR for branded terms.
Essentially it’s about setting the scene and
creating that narrative.
Click and Impressions
46. Topic modelling + forecasting
Utilising a automated time series model like Facebook
Prophet can bring you powerful insight into where
specific categories of your SERP footprint is going.
1. You’ll need to have previous data connected with dates
with each category which is easy to pull from GSC. May
need formatting but can be pulled via. Data studio.
2. Plug this into Facebook Prophet via. Python and plot with
matplotlib.
3. Set a upper or lower yield for certain dates based on SEO
activity.
If you’re new to using Facebook Prophet, Ahrefs has an excellent
article titled: How to Use Data Forecasting for SEO with the scripts
included.
48. State of play Actions Climax
Target areas for optimisation and
where you need to expand to for
maximum users.
Deliverables that need to be done
in order to capitalise on targets
identified.
What these actions are going to
drive and the ROI in terms of
effort.
What this approach will tell you
Add
icon
Add
icon
Add
icon
50. What we’re doing using NLP
● Ideation tool - Not just scraping the search results, but combining that with video, social media, forums, Reddit plus exploring using time
series forecasting model with Google Trends to create meaningful content strategies at scale.
● Automated content briefs - Assisting content writers with detailed SERP analysis beyond keywords looking at intent, TOV, structure and
sentiment.
● HREFLANG - For large international sites with minimal consistent transcreation, looking to spot similar content across multiple languages.
● Schema opportunity spotting - Particularly not so difficult for things like FAQ schema but using NLP to find text on site that can utilise the
many other schemas available by using a pre-trained dataset to identify.
Natural Language Processing is not limited to the examples in this deck, there’s a huge amount of potential for SEOs
to harness the quality that NLP can bring to any site. Here’s an example of some of the tools we’re currently working
with:
51. Who to follow
● Everyone at BrightonSEO
● Andrea Volpini (WordLift)
● JR Oakes (Locomotive)
● Tiago Conclaves (Valtech)
● Ruth Burr Reedy (UpBuild)
And a big shout out to Hamlet Batista who is such an
influence within this space, and whose insight, talent
and personality will be sorely missed.
52. Tools & resources
Python for Beginners
Python for SEO by JC Chouinard
Google NLP API Tool: Optimize Your Content to the Next
Level (Semrush)
Google Colab
6 SEO Tasks to Automate with Python by Winston Burton
(Acronym)
NLP in SEO: Is it worth your time? (OnCrawl)
Advanced SEO Strategies using Natural Language
Processing (WordLift)