Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

SophiaConf 2018 - M. LACAGE (ALCMEON)

25 vues

Publié le

Support de présentation : Mining Text data for Topics
Ak : Unsupervised clustering

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

SophiaConf 2018 - M. LACAGE (ALCMEON)

  1. 1. Mining text data for topics Aka: Unsupervised clustering mathieu.lacage@alcmeon.com
  2. 2. The objective Input: corpus of text document Output: ● List of topics (max 10 to 40) ● Human description for each topic ● Size of each topic What this talk is about: ● Help you get quickly a rough idea of what this content is about ● No requirements that you are a master of deep learning concepts, fancy maths
  3. 3. How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping
  4. 4. How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping Input: “hey, how are you?” Output: [“hey”, “how”, “are”, “you”, “?”] N documents
  5. 5. How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping Input: [“hey”, “how”, “are”, “you”, “?”] Output: M-sized vector [0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...] N documents, M distinct tokens (dictionary size)
  6. 6. Input: N * M matrix: [[0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...], …] Output: N vector: [2, 4, 2, 1, 8, …, 9, 3, 0] How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping N documents, M distinct tokens (dictionary size), K topics
  7. 7. The code On github: https://github.com/mathieu-lacage/sophiaconf2018 1. Collect a dataset do-collect.py -k france 2. Tokenize text do-tokenize.py --lang=fr 3. Calculate document frequencies do-df.py --min-df=1 4. Generate document vectors do-bag-of-words.py --model=boolean 5. Cluster vectors do-kmeans.py -k 10 6. Visualize the clusters do-summarize.py
  8. 8. Step 1: collect a dataset do-collect.py -k france “Sample” Twitter stream: ● 1% of all tweets which contain the word “france” ● ran a couple of hours on June 25th Be careful: ● Hardcoded twitter app ids ● Generate your own app ids: https://apps.twitter.com/ !
  9. 9. Step 2: tokenize the text input do-tokenize.py --lang=fr Depends on language ● “Easy” for english: spaces, hyphens are word boundaries. ● CJKV languages: no space. (tough) → We focus on a “simple” language and open-source library (NLTK) to ignore the problem
  10. 10. Step 3: calculate document frequencies do-df.py --min-df=1 Number of documents which contain each token at least once Eliminate all tokens which appear only once Store number of documents as a special zero-length string token [-1, "", 10842] [0, "https://t.co/lzpNXIe2if", 1]
  11. 11. Step 4: generate document vectors do-bag-of-words.py --model=boolean Models ● boolean: the simplest model: 1 if token is present in document, 0 otherwise ● tf-idf: More weight for tokens which appear rarely in corpus → we start with the simplest option !
  12. 12. Step 5: Cluster document vectors do-kmeans.py -k=10 Search 10 clusters: ● Complexity = O(nmk+1 ) → hurts ● MiniBatch option is much faster but less stable numerically ● What you really want is reduce M (curse of dimensionality)
  13. 13. Step 6: visualize the clusters do-summarize.py Keep the tokens where the difference between: ● Frequency of tokens in cluster ● Frequency of tokens in corpus Is highest → Inspired by KL divergence
  14. 14. Results 0. 3165 MAIS PERSONNE https://t.co/Xg4fOi9Q1c ACCOSTER #TraduisonsLes 1. 2407 prenne égalera battra protéger entrer 2. 255 bousiller travaillé aies gar jaloux 3. 372 262 légaux 3A https://t.co/WyunDG4wLs optim 4. 896 tchadien Tchad zénith annonçons lor 5. 110 traiter https://t.co/zCAlZJjzfX rt pute met 6. 326 GAGNANTS https://t.co/1XGv3j526K PASSE PayPal 7. 2598 Mauvais marquage Archives-Verrerie chuter Générosités 8. 242 https://t.co/byRBwkSa3U Faire l'île 9. 471 altitude giflée bled baisser Francais
  15. 15. Comments Small clusters are pretty coherent Big clusters are a mix of lots of small clusters → Choosing a good K is crucial ! ● Too small: mishmash of topics ● Too big: many small clusters which are all about the same topic
  16. 16. Things you could do 1. More/different data 2. Compare accuracy loss of MiniBatchKMeans against kMeans 3. Test other clustering algorithms 4. Better summarization 5. Visualize topic relationships 6. Compare LSA and LDA to Clustering output 7. Automatically pick number of topics by optimizing for silhouette coefficient 8. Decrease dimensionality with doc2vec, LSA or LDA to replace bag-of-words 9. ...
  17. 17. Questions ?
  18. 18. Dimensionality reduction: word2vec python ./do-word-vector-model.py -d sample-big mv sample-big-word-vector-mode sample-word-vector-model python ./do-doc2vec.py “Distributed Representations of Words and Phrases and their Compositionality”, 2013 Open source implementation: gensim

×