1. Mining text data for topics
Aka: Unsupervised clustering
mathieu.lacage@alcmeon.com
2. The objective
Input: corpus of text document
Output:
● List of topics (max 10 to 40)
● Human description for each topic
● Size of each topic
What this talk is about:
● Help you get quickly a rough idea of what this content is about
● No requirements that you are a master of deep learning concepts, fancy maths
3. How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
4. How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
Input: “hey, how are you?”
Output: [“hey”, “how”, “are”, “you”, “?”]
N documents
5. How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
Input: [“hey”, “how”, “are”, “you”, “?”]
Output: M-sized vector [0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...]
N documents, M distinct tokens (dictionary size)
6. Input: N * M matrix: [[0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...], …]
Output: N vector: [2, 4, 2, 1, 8, …, 9, 3, 0]
How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
N documents, M distinct tokens (dictionary size), K topics
7. The code
On github: https://github.com/mathieu-lacage/sophiaconf2018
1. Collect a dataset do-collect.py -k france
2. Tokenize text do-tokenize.py --lang=fr
3. Calculate document frequencies do-df.py --min-df=1
4. Generate document vectors do-bag-of-words.py --model=boolean
5. Cluster vectors do-kmeans.py -k 10
6. Visualize the clusters do-summarize.py
8. Step 1: collect a dataset
do-collect.py -k france
“Sample” Twitter stream:
● 1% of all tweets which contain the word “france”
● ran a couple of hours on June 25th
Be careful:
● Hardcoded twitter app ids
● Generate your own app ids: https://apps.twitter.com/ !
9. Step 2: tokenize the text input
do-tokenize.py --lang=fr
Depends on language
● “Easy” for english: spaces, hyphens are word boundaries.
● CJKV languages: no space. (tough)
→ We focus on a “simple” language and open-source library (NLTK) to ignore the
problem
10. Step 3: calculate document frequencies
do-df.py --min-df=1
Number of documents which contain each token at least once
Eliminate all tokens which appear only once
Store number of documents as a special zero-length string token
[-1, "", 10842]
[0, "https://t.co/lzpNXIe2if", 1]
11. Step 4: generate document vectors
do-bag-of-words.py --model=boolean
Models
● boolean: the simplest model: 1 if token is present in document, 0 otherwise
● tf-idf: More weight for tokens which appear rarely in corpus
→ we start with the simplest option !
12. Step 5: Cluster document vectors
do-kmeans.py -k=10
Search 10 clusters:
● Complexity = O(nmk+1
) → hurts
● MiniBatch option is much faster but less stable numerically
● What you really want is reduce M (curse of dimensionality)
13. Step 6: visualize the clusters
do-summarize.py
Keep the tokens where the difference between:
● Frequency of tokens in cluster
● Frequency of tokens in corpus
Is highest
→ Inspired by KL divergence
15. Comments
Small clusters are pretty coherent
Big clusters are a mix of lots of small clusters
→ Choosing a good K is crucial !
● Too small: mishmash of topics
● Too big: many small clusters which are all about the same topic
16. Things you could do
1. More/different data
2. Compare accuracy loss of MiniBatchKMeans against kMeans
3. Test other clustering algorithms
4. Better summarization
5. Visualize topic relationships
6. Compare LSA and LDA to Clustering output
7. Automatically pick number of topics by optimizing for silhouette coefficient
8. Decrease dimensionality with doc2vec, LSA or LDA to replace bag-of-words
9. ...
18. Dimensionality reduction: word2vec
python ./do-word-vector-model.py -d sample-big
mv sample-big-word-vector-mode sample-word-vector-model
python ./do-doc2vec.py
“Distributed Representations of Words and Phrases and their Compositionality”, 2013
Open source implementation: gensim