Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Using topic modelling frameworks for NLP and semantic search

12 867 vues

Publié le

As the volume of content continues to grow exponentially helping search engines to understand context and the topical themes within your site is increasingly important. Understanding some of the concepts are covered and also ways to utilise these in your marketing strategy.

Publié dans : Marketing
  • Soyez le premier à commenter

Using topic modelling frameworks for NLP and semantic search

  1. 1. Using Topic Modelling To Win Big with NLP & Semantic Search Dawn Anderson @DawnieAndo from @MoveItMarketing
  2. 2. If I said to you… “I’ve got a new jaguar”
  3. 3. “It’s in the garage” (sidenote: this is not my garage)
  4. 4. You probably wouldn’t expect to see this
  5. 5. Who said anything about cars
  6. 6. “I’ve got a new jag”
  7. 7. “Jag is neither a car nor a cat”
  8. 8. Enable gzip compression via caching plugins, .htaccess or via compression plugins MOBILE NO INTERSTITIALS ON MOBILE THANK YOU Polysemy in linguistics is problematic
  9. 9. DISCLAIMER I am NOT a data scientist
  10. 10. But I will be talking about some concepts covering: Data Sciene 01 Information Retrieval 02 Algorithms 03 Linguistics 04 Information Architecture 05 Library Science
  11. 11. Which are areas relevant to search industry
  12. 12. These are all connected to how search engines find the right information, for the right informational need at the right time
  13. 13. ‘information retrieval’ To extract informational resources to meet a search engine user’s information need at time of query.
  15. 15. Crawl (and render) the haystack = Crawling frontier
  16. 16. Organise the straw into bales – indexing (2 wave)
  17. 17. Chunking and Tokenization
  18. 18. Inverted Index: Text to Doc ID Mapping
  19. 19. To then…
  20. 20. But there is so much hay
  21. 21. 8 YEARS AGO
  22. 22. Every day there are huge volumes of new indexable data
  23. 23. But we only want to return one (or a few) needle (s) of hay
  24. 24. Accelerating technological developments have made search even more complicated
  25. 25. One example … Google’s mobile-first indexing plans
  27. 27. Time and space, distance, speed of movement come into play
  28. 28. Contextual Search Exacerbates everything further
  29. 29. A lot of users might be interested in topical foraging too (information foraging theory)
  30. 30. They might want to learn about whole topic of hay or straw
  31. 31. 900 Or they may be researching to buy a car and want lots of different types of information on cars
  32. 32. You think there are several types of SEO? Local SEO Technical SEO Schema Specialist Content Marketer Outreach Specialist Digital PR
  33. 33. There are at least as many niche areas of Information Retrieval Mobile IR Contextual Search Natural Language Processing Conversational Search Similarity Search Recommender Systems Library Science
  34. 34. The problem is… words are hard
  35. 35. Every other word in the English language has multiple meanings
  36. 36. But…If we understood a topic is about cats we would recognize a jaguar
  37. 37. On their own single words have no semantic meaning
  38. 38. How can we understand these word meanings?
  39. 39. Using structured data is an obvious way to disambiguate
  40. 40. Structured versus unstructured data • Structured data – high degree of organization • Readily searchable by simple search engine algorithms or known search operators (e.g. SQL) • Logically organized • Often stored in a relational database
  41. 41. Relational database systems
  42. 42. Knowledge Graphs
  43. 43. Mapping RDF Triples
  44. 44. Entities
  45. 45. Conversational search The knowledge graph is checked first
  46. 46. Ontology Driven Natural Language Processing Image credit: IBM https://www.ibm.com/developerworks/community/blogs/nlp/entry/ontology_driven_nlp
  47. 47. Even named entities can be ambiguous / polysemic • Amadeus Mozart (composer) • Mozart Street • Mozart Cafe
  48. 48. Australian towns using English names (NSW example only)
  49. 49. An area of IR dedicated to understanding the ambiguous needs for queries with multiple meanings
  50. 50. How can we fill in the gaps between named entities?
  51. 51. When there is so much noise
  52. 52. There are still many open challenges in natural language processing
  53. 53. Text cohesion • Cohesion is the grammatical and lexical linking within a text or sentence that holds a text together and gives it meaning. • It is related to the broader concept of coherence. (Wikipedia)
  54. 54. ‘Topic Modelling’ According to Wikipedia: “In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body”.
  55. 55. A collection of text based web pages (corpus)
  56. 56. A collection of text based documents (corpus)
  57. 57. Term Clarification Machine learning -> dataset Information retrieval -> corpora / corpus
  58. 58. We can disambiguate through co-occurrence
  59. 59. “You shall know a word by the company it keeps” (John Rupert Firth, 1957)
  60. 60. Using similarity & relatedness
  61. 61. Relatedness is NOT about structured data
  62. 62. 2 words are similar if they co-occur with similar words
  63. 63. First Level Relatedness – Words that appear together in the same sentence
  64. 64. 2 words are similar if they occur in a given grammatical relation with the same words Harvest Peel Eat Slice
  65. 65. Second Level Relatedness – words that co-occur with the same ‘other’ words
  66. 66. We need more words
  67. 67. WordSim353 Dataset Some words with high similarity or relatedness Word 1 Word 2 Human (mean) tiger tiger 10 fuck sex 9.44 journey voyage 9.29 midday noon 9.29 dollar buck 9.22 money cash 9.15 coast shore 9.1 money cash 9.08 money currency 9.04 football soccer 9.03 magician wizard 9.02 type kind 8.97 gem jewel 8.96 car automobile 8.94 street avenue 8.88 asylum madhouse 8.87 boy lad 8.83 environment ecology 8.81 furnace stove 8.79 seafood lobster 8.7 mile kilometer 8.66 Maradona football 8.62 OPEC oil 8.59 king queen 8.58 murder manslaughter 8.53 money bank 8.5 computer software 8.5 Jerusalem Israel 8.46 vodka gin 8.46 planet star 8.45
  68. 68. A Moving Word ‘Context Window’
  69. 69. Typical window size might be 5 Source Text Writing a list of random sentences is harder than I Initially thought it would be Writing a list of random sentences is harder than I Initially thought it would be Writing a list of random sentences is harder than I Initially thought it would be Writing a list of random sentences is harder than I Initially thought it would be 11 letters (5 left and 5 right of the moving target word)
  70. 70. To build vector space models
  71. 71. Vector representations of words (Word Vectors)
  72. 72. Vector space models
  73. 73. Word embeddings example
  74. 74. Nearest Neighbours (Similarity) Evaluations KNN – K-Nearest-Neighbour
  75. 75. Tensorflow & Word2Vec
  76. 76. Continuous Bag of Words (CBOW) Taking a continuous bag of words with no context utilize a context window of n size n-gram) to ascertain words which are similar or related using Euclidean distances to create vector models and word embeddings
  77. 77. The opposite of CBOW (continuous bag of words) Skip-gram model
  78. 78. Feed it WordPairs
  79. 79. Both models learn the weights of the similarity and relatedness distances
  80. 80. Vector space models are being expanded beyond Word2Vec Word2Vec Doc2Vec Sentence2Vec Paragraph2Vec
  81. 81. Word2Vec Single words Word embeddings 01 Doc2Vec Words & meta data Word embeddings Document embeddings 02 Sentence2Vec Chunks of words More context available 03 Paragraph2Vec Full paragraphs Even more context and semantics 04
  82. 82. In order to understand what words in documents constitute ‘relevance’ to a query
  83. 83. Testing Similarity and Relatedness http://ws4jdemo.appspot.com
  84. 84. GloVe: Global Vectors for Word Representation • What is GloVe? • “GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.” • (https://nlp.stanford.edu/projects/glove/)
  85. 85. Linear Substructures in GloVe • Sometimes more than one word pair is needed to understand meaning • Particularly when the words are opposites of each other (e.g. man and woman) • By adding addition word pairs further semantic hints provide context to understand meaning of concepts
  86. 86. GloVe: Nearest Neighbour Cosine Similarity • Nearest words to Frog • https://nlp.stanford.edu/projects/glove/ Glove2Vec
  87. 87. Concept2Vec
  88. 88. Concept2Vec Ontological concepts
  89. 89. Concept Graphs Using Relatedness
  90. 90. Wikipedia is a gold mine for IR researchers – each page is considered a concept
  91. 91. Similarity and Relatedness Similarity – words that mean the same or nearly the same Relatedness - Words that live together within a topic / co-occur in the same corpora / collection / sub-section of a collection
  92. 92. Part of speech tagging (POS)
  93. 93. ‘Part of Speech’ (POS) tagging
  94. 94. A website is NOT unstructured data It has a hierarchy It has weighted sections It has metadata It (often) has a tree like structure
  95. 95. • BM25 • BM25+ • BM25L • OKAPI BM25
  96. 96. BM = BEST MATCH
  97. 97. On long documents BM25 fails
  98. 98. Probably BM25F is used for web pages BM25F allows for web pages which have structure compared with normal flat text output (e.g. from text files) (additional fields) Takes into consideration elements such as page title, meta data, sections, footers, headers, anchor text Adds weights for different elements on a page
  99. 99. Anchor text is included in BM25F
  100. 100. Semi- structured data • Hierarchical nature of a website • Tree structure • Well sectioned and including clear containers and meta headings • An ontology map between semi and structured
  101. 101. Lexical ‘nyms’ Antonym – The opposite meaning Synonym – The same meaning Meronym – Part of something else (whole) – e.g. finger (hand) (Part / whole relations) Hyponym – A subset of something else – e.g. fork (cutlery) Hypernym – A superset (superordinate) – e.g. colour hypernym
  102. 102. TF:IDF Term frequency: Inverse document frequency
  103. 103. TF:IDF LOCAL v GLOBAL? (The whole document collection) Across your site?? Across all documents relevant for the topic??
  104. 104. A website is NOT unstructured data It has a hierarchy It has weighted sections It has metadata It (often) has a tree like structure
  105. 105. Keyword Stuffing or TF:IDF Weights
  106. 106. Query Intent Shift
  107. 107. “Easter” Query Intent Shift
  108. 108. Predicting the future with Web Dynamics • The journey to predict the future: Kira Radinsky at TEDxHiriya
  109. 109. Find out what correlates and when
  110. 110. How can we improve our topical relatedness?
  111. 111. Tell Me About Your Haystack
  112. 112. Cancel your noise Unstructured data is voluminous Filled with irrelevance Lacks focus Riddled with stopwords Lots of meaningless text and further ambiguating jabber
  113. 113. Disambiguate lean content with powerful structured data
  114. 114. And strong linking nearest neighbour topically rich pages
  115. 115. Use well organised hyponyms (Hyponomy and Hypernymy) • Cutlery • Spoons • Dessert • Tea • Table • Forks • Knives • Carving • Steak • Butchers Hypernym Hyponym + Hypernym (co) Hyponym (co) Hyponym (co) Hyponym Hyponym + Hypernym Hyponym + Hypernym (co) Hyponym (co) Hyponym (co) Hyponym Simple unordered list with children
  116. 116. Image alt tags (and image title tags) help with disambiguation too
  117. 117. Stemming & Lemmatization Both aim to take a word back to its common base form Avoid keyword stuffing… be aware of stemming and lemmatization
  118. 118. Tables are relational databases too – use liberally (with headers) ID Event Name Event Type Event City Event Country 1 Ungagged Las Vegas Conference Las Vegas US 2 Ungagged London Conference London UK 3 State of Digital Conference London UK 4 Brighton SEO Conference Brighton UK
  119. 119. Widget Logic / Widget Context
  120. 120. Stay in your topical lane Topical drift / dilution is a big problem
  121. 121. Explore topical siloes
  122. 122. Merge content but watch out for topical dilution – what did Wikipedia redirect? The whole is greater than the sum of its parts
  123. 123. Check Wikipedia redirects for your niche
  124. 124. Wikipedia redirects • Dbo:wikiPageRedirects
  125. 125. Ludwig Van Beethoven
  126. 126. Ludwig Van Beethoven
  127. 127. In theory… the consolidated page should rank higher… but…
  128. 128. Extract the conversations
  129. 129. Throw the words into a word cloud
  130. 130. So the most prominent topics & nuances appear
  131. 131. Watch out for topic dilution / drift in user generated content
  132. 132. Educate crazy taggers but not before you’ve used their topic tags to fix dilution
  133. 133. All the anchors & contextual & navigational internal linking
  134. 134. Even if it is just a breadcrumb trail
  135. 135. Sources and References • Kira Radinsky Tedx Talk - https://www.youtube.com/watch?v=gAifa_CVGCY • Stop Word Library Example - https://sites.google.com/site/kevinbouge/stopwords-lists • Image credit: Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc. • The work of - Radinsky, K., 2012, December. Learning to predict the future using Web knowledge and dynamics. In ACM SIGIR Forum(Vol. 46, No. 2, pp. 114-115). ACM. • http://9ol.es/porter_js_demo.html
  136. 136. Sources and References • Lohar, P., Ganguly, D., Afli, H., Way, A. and Jones, G.J., 2016. FaDA: Fast document aligner using word embedding. The Prague Bulletin of Mathematical Linguistics, 106(1), pp.169-179. • https://en.wikipedia.org/wiki/List_of_locations_in_Australia_with_an _English_name • Barbara Plank | Keynote - Natural Language Processing: - https://www.youtube.com/watch?v=Wl6c0OpF6Ho
  137. 137. Further Reading • https://github.com/Hironsan/awesome-embedding-models • https://nlp.stanford.edu/IR-book/html/htmledition/document- representations-and-measures-of-relatedness-in-vector-spaces- 1.html • https://www.youtube.com/watch?time_continue=790&v=wI5O- lYLBCw • https://en.wikipedia.org/wiki/Euclidean_distance • Ibrahim, O.A.S. and Landa-Silva, D., 2016. Term frequency with average term occurrences for textual information retrieval. Soft Computing, 20(8), pp.3045-3061.
  138. 138. Further Reading • Lotfi, A., Bouchachia, H., Gegov, A., Langensiepen, C. and McGinnity, M., 2018. Advances in Computational Intelligence Systems. Intelligence. • https://www.researchgate.net/post/What_is_the_difference_betwee n_TFIDF_and_term_distribution_for_feature_selection • https://radimrehurek.com/gensim/models/word2vec.html • McDonald, R., Brokos, G.I. and Androutsopoulos, I., 2018. Deep relevance ranking using enhanced document-query interactions. arXiv preprint arXiv:1809.01682.
  139. 139. Further Reading Boyd-Graber, J., Hu, Y. and Mimno, D., 2017. Applications of topic models. Foundations and Trends® in Information Retrieval, 11(2-3), pp.143-296. https://nlp.stanford.edu/projects/glove/ https://nlp.stanford.edu/IR-book/html/htmledition/tokenization- 1.html Sherkat, E. and Milios, E.E., 2017, June. Vector embedding of wikipedia concepts and entities. In International conference on applications of natural language to information systems (pp. 418-428). Springer, Cham.
  140. 140. Appendix
  141. 141. Precision and Recall GOLD STANDARD Lots of results inaccurately deemed highly relevant retrieved. Lots of results inaccurately deemed irrelevant not retrieved Maybe not enough relevant documents to fetch much here Lots of documents came back but not enough highly relevant Many highly relevant docs to meet informational need returned Results were highly relevant but not enough came back Maybe being ‘too picky’ A wide net was cast but not many of the right type of fishes caught https://medium.com/@klintcho/explaining-precision-and-recall-c770eb9c69e9 Manning, C.D., Manning, C.D. and Schütze, H., 1999. Foundations of statistical natural language processing. MIT press
  142. 142. Canadian towns using English names
  143. 143. US towns using English names
  144. 144. You should be aware ‘indexing’ and ‘ranking’ are two very separate things
  145. 145. Example context window size 3 Source Text Training Samples The quick brown fox jumps over the lazy dog (the, quick) (the, brown) (the, fox) The quick brown fox jumps over the lazy dog (quick, the) (quick, brown) (quick, fox) (quick, jumps) The quick brown fox jumps over the lazy dog Etcetera The quick brown fox jumps over the lazy dog Etcetera
  146. 146. Stemming (Popular stemmer is PorterStemmer) • https://github.com/johnpcarty/mysql-porter- stemmer/blob/master/porterstemmer.sql Conditions Suffix Replacement Examples -------------------------- ------- ------------- ----------------------- (m>1) al NULL revival -> reviv (m>1) ance NULL allowance -> allow (m>1) ence NULL inference -> infer (m>1) er NULL airliner-> airlin (m>1) ic NULL gyroscopic -> gyroscop (m>1) able NULL adjustable -> adjust (m>1) ible NULL defensible -> defens (m>1) ant NULL irritant -> irrit (m>1) ement NULL replacement -> replac (m>1) ment NULL adjustment -> adjust (m>1) ent NULL dependent -> depend (m>1 and (*<S> or *<T>)) ion NULL adoption -> adopt (m>1) ou NULL homologou-> homolog (m>1) ism NULL communism-> commun (m>1) ate NULL activate -> activ (m>1) iti NULL angulariti -> angular (m>1) ous NULL homologous -> homolog (m>1) ive NULL effective -> effect (m>1) ize NULL bowdlerize -> bowdler
  147. 147. WordSim353 Dataset Some words with low similarity or relatedness Word 1 Word 2 Human (mean) king cabbage 0.23 professor cucumber 0.31 chord smile 0.54 noon string 0.54 rooster voyage 0.62 sugar approach 0.88 stock jaguar 0.92 stock life 0.92 monk slave 0.92 lad wizard 0.92 delay racism 1.19 stock CD 1.31 drink ear 1.31 stock phone 1.62 holy sex 1.62 production hike 1.75 precedent group 1.77 stock egg 1.81 energy secretary 1.81 month hotel 1.81 forest graveyard 1.85 cup substance 1.92 possibility girl 1.94 cemetery woodland 2.08 glass magician 2.08 cup entity 2.15 Wednesday news 2.22 direction combination 2.25
  148. 148. Coast and Shore Example • Coast and shore have a similar meaning • They co-occur in first and second level relatedness documents in a collection • They would receive a high score in similarity
  149. 149. SPARQL Editor & WikiPageRedirects
  150. 150. There are 68 variants for mobile phone redirected on Wikipedia
  151. 151. Word’s Context
  152. 152. Context window (n = window size = 2) (2 words either side) Source Text Training Samples The quick brown fox jumps over the lazy dog (the, quick) (the, brown) The quick brown fox jumps over the lazy dog (quick, the) (quick, brown) (quick, fox) The quick brown fox jumps over the lazy dog (brown, the) (brown, quick) (brown, fox) (brown, jumps) The quick brown fox jumps over the lazy dog (fox, quick) (fox, brown) (fox, jumps) (fox, over)
  153. 153. Other areas of IR Including… but not limited to Mobile IR Contextual IR Natural language processing Conversational search Similarity search Recommender systems Image IR Music IR
  154. 154. Recall and Precision in IR Precision is the best results – the most relevant for the query Recall is all of the results returned for the query
  155. 155. Likely based on co-occurrence data https://slideplayer.com/slide/13138343/ - Query Expansion and Relevance Feedback Increases recall (more results) but may reduce precision Query Expansion Example
  156. 156. Increase Recall – 2 Main Methods Query Expansion / Query Rewriting Query Relaxation • Ignore stop words in query • Relax specificity (remove specific) • Use lexical database (a “knowledge graph” (e.g. Wordnet) to find a more general term (hypernym - superset) • Use Part of Speech Tagger to identify structure of query & expand nouns (things) • Preserve head noun and strip modifiers (e.g. dog) • Use Word2Vec to identify semantics from a vector space using word embeddings from the query – find semantics (related and similar) Take bits away from the query Add bits to the query • Broaden the query • Expand abbreviations • Stemming and lemmatization (in reverse) • Use Word2Vec to find abbreviations in a vector space using semantic similarity • Use synonyms (same / very similar meanings) • Use minimum cosine similarity from Word2Vec as safety net
  157. 157. Wikipedia redirects • Dbo:wikiPageRedirects
  158. 158. Most popular word embedding tool – probably Word2Vec (new ones are emerging)
  159. 159. Stemming & Lemmatization Stemming • Runs a series of rules to chop known ‘stems’ off the end of words • Suffix stripping algorithm • Often leaves incorrect endings/ crude performance (understemming) • Popular – PorterStemmer (Martin Porter) • Example: “alumnus” -> “alumnu” Lemmatization • Aims to do things properly • Tool from natural language processing • Needs a complete vocabulary and morphological analysis to work well • Also not perfect • Relies on lexical knowledge base like WordNet to correct base form
  160. 160. Gensim
  161. 161. Part of Speech Tags (Python NLTK Library) • NNPS proper noun, plural ‘Americans’ • PDT predeterminer ‘all the kids’ • POS possessive ending parent’s • PRP personal pronoun I, he, she • PRP$ possessive pronoun my, his, hers • RB adverb very, silently, • RBR adverb, comparative better • RBS adverb, superlative best • RP particle give up • TO, to go ‘to’ the store. • UH interjection, errrrrrrrm • CC coordinating conjunction • CD cardinal digit • DT determiner • EX existential there (like: “there is” … think of it like “there exists”) • FW foreign word • IN preposition/subordinating conjunction • JJ adjective ‘big’ • JJR adjective, comparative ‘bigger’ • JJS adjective, superlative ‘biggest’ • LS list marker 1) • MD modal could, will • NN noun, singular ‘desk’ • NNS noun plural ‘desks’ • NNP proper noun, singular ‘Harrison’ • VB verb, base form take • VBD verb, past tense took • VBG verb, gerund/present participle taking • VBN verb, past participle taken • VBP verb, sing. present, non-3d take • VBZ verb, 3rd person sing. present takes • WDT wh-determiner which • WP wh-pronoun who, what • WP$ possessive wh-pronoun whose • WRB wh-abverb where, when
  162. 162. Enable gzip compression via caching plugins, .htaccess or via compression plugins MOBILE NO INTERSTITIALS ON MOBILE THANK YOU Stop Word Libraries are Huge
  163. 163. Enable gzip compression via caching plugins, .htaccess or via compression plugins MOBILE NO INTERSTITIALS ON MOBILE THANK YOU a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere apart appear appreciate appropriate are aren't around as aside ask asking associate d at available away awfully ‘A’ words in an EN stop word list
  164. 164. • Anaphora resolution (AR) which most commonly appears as pronoun resolution is the problem of resolving references to earlier or later items in the discourse. • Example: "John found the love of his life" where 'his' refers to 'John’ • ‘His’ refers to John (easily understood by humans but not so much by machines) Example and definition from: https://nlp.stanford.edu/courses/cs224n/2003/fp/iqsayed/project_report.pdf Anaphora
  165. 165. Cataphora – According to Wikipedia In linguistics cataphora is the use of an expression or word that co-refers with a later, more specific, expression in the discourse. EXAMPLE: “When he arrived home, John went to sleep” HE IS JOHN, BUT JOHN WAS NOT KNOWN WHEN ‘HE’ WAS REFERRED TO – SO CAUSES CONFUSION REGARDING WHO ‘HE’ IS
  166. 166. Anaphora and Coreference Resolution • There are some algorithms in place to handle anaphora resolution • In conversational search this still struggles after a few multi-turn questions
  167. 167. NLTK Toolkit Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.
  168. 168. Chunk = Usually chunks of words Token = Usually a word
  169. 169. Digital Marketing Events – Using Tables • https://research.google.com /tables
  170. 170. Information Architecture for the World Wide Web, 3rd Edition by Louis Rosenfeld, Peter Morville
  171. 171. How can all this help you as an SEO? • Consider exploring your topical models by using noindex admin only tag clouds to visualise the topics you see • Utilise relatedness considering words likely in Sim353 or other data sources • Pass supporting topical hints to emphasise the meanings in 1st and 2nd level relatedness • Consider query intent shift and mobile IR / contextual search as niche fields of IR • Utilise semi-structured elements to strengthen unstructured pages in noisy (particularly longer) pages
  172. 172. How can all this help you as an SEO? • Further disambiguation measures on locations used in different countries with same name • Be consistent in your naming conventions. Refer to Wikipedia if in doubt – check their redirects on terms • Other semantic clues for entities with same name – e.g. gender / role / location / geographic clues • Utilise anchors to emphasise further from 2nd level relatedness pages • Utilise co-occurring terms from databases considered similar / connected / related