SophiaConf 2018 - M. LACAGE (ALCMEON)

Mining text data for topics
Aka: Unsupervised clustering
mathieu.lacage@alcmeon.com

The objective
Input: corpus of text document
Output:
● List of topics (max 10 to 40)
● Human description for each topic
● Size of each topic
What this talk is about:
● Help you get quickly a rough idea of what this content is about
● No requirements that you are a master of deep learning concepts, fancy maths

How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping

Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
Input: “hey, how are you?”
Output: [“hey”, “how”, “are”, “you”, “?”]
N documents

Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
Input: [“hey”, “how”, “are”, “you”, “?”]
Output: M-sized vector [0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...]
N documents, M distinct tokens (dictionary size)

Input: N * M matrix: [[0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...], …]
Output: N vector: [2, 4, 2, 1, 8, …, 9, 3, 0]
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
N documents, M distinct tokens (dictionary size), K topics

The code
On github: https://github.com/mathieu-lacage/sophiaconf2018
1. Collect a dataset do-collect.py -k france
2. Tokenize text do-tokenize.py --lang=fr
3. Calculate document frequencies do-df.py --min-df=1
4. Generate document vectors do-bag-of-words.py --model=boolean
5. Cluster vectors do-kmeans.py -k 10
6. Visualize the clusters do-summarize.py

Step 1: collect a dataset
do-collect.py -k france
“Sample” Twitter stream:
● 1% of all tweets which contain the word “france”
● ran a couple of hours on June 25th
Be careful:
● Hardcoded twitter app ids
● Generate your own app ids: https://apps.twitter.com/ !

Step 2: tokenize the text input
do-tokenize.py --lang=fr
Depends on language
● “Easy” for english: spaces, hyphens are word boundaries.
● CJKV languages: no space. (tough)
→ We focus on a “simple” language and open-source library (NLTK) to ignore the
problem

Step 3: calculate document frequencies
do-df.py --min-df=1
Number of documents which contain each token at least once
Eliminate all tokens which appear only once
Store number of documents as a special zero-length string token
[-1, "", 10842]
[0, "https://t.co/lzpNXIe2if", 1]

Step 4: generate document vectors
do-bag-of-words.py --model=boolean
Models
● boolean: the simplest model: 1 if token is present in document, 0 otherwise
● tf-idf: More weight for tokens which appear rarely in corpus
→ we start with the simplest option !

Step 5: Cluster document vectors
do-kmeans.py -k=10
Search 10 clusters:
● Complexity = O(nmk+1
) → hurts
● MiniBatch option is much faster but less stable numerically
● What you really want is reduce M (curse of dimensionality)

Step 6: visualize the clusters
do-summarize.py
Keep the tokens where the difference between:
● Frequency of tokens in cluster
● Frequency of tokens in corpus
Is highest
→ Inspired by KL divergence

Results
0. 3165 MAIS PERSONNE https://t.co/Xg4fOi9Q1c ACCOSTER #TraduisonsLes
1. 2407 prenne égalera battra protéger entrer
2. 255 bousiller travaillé aies gar jaloux
3. 372 262 légaux 3A https://t.co/WyunDG4wLs optim
4. 896 tchadien Tchad zénith annonçons lor
5. 110 traiter https://t.co/zCAlZJjzfX rt pute met
6. 326 GAGNANTS https://t.co/1XGv3j526K PASSE PayPal
7. 2598 Mauvais marquage Archives-Verrerie chuter Générosités
8. 242 https://t.co/byRBwkSa3U Faire l'île
9. 471 altitude giflée bled baisser Francais

Comments
Small clusters are pretty coherent
Big clusters are a mix of lots of small clusters
→ Choosing a good K is crucial !
● Too small: mishmash of topics
● Too big: many small clusters which are all about the same topic

Things you could do
1. More/different data
2. Compare accuracy loss of MiniBatchKMeans against kMeans
3. Test other clustering algorithms
4. Better summarization
5. Visualize topic relationships
6. Compare LSA and LDA to Clustering output
7. Automatically pick number of topics by optimizing for silhouette coefficient
8. Decrease dimensionality with doc2vec, LSA or LDA to replace bag-of-words
9. ...

Dimensionality reduction: word2vec
python ./do-word-vector-model.py -d sample-big
mv sample-big-word-vector-mode sample-word-vector-model
python ./do-doc2vec.py
“Distributed Representations of Words and Phrases and their Compositionality”, 2013
Open source implementation: gensim

SophiaConf 2018 - M. LACAGE (ALCMEON)

Recommended

Recommended

More Related Content

Similar to SophiaConf 2018 - M. LACAGE (ALCMEON)

Similar to SophiaConf 2018 - M. LACAGE (ALCMEON) (20)

More from TelecomValley

More from TelecomValley (20)

Recently uploaded

Recently uploaded (20)

SophiaConf 2018 - M. LACAGE (ALCMEON)