SlideShare a Scribd company logo
1 of 18
Download to read offline
Mining text data for topics
Aka: Unsupervised clustering
mathieu.lacage@alcmeon.com
The objective
Input: corpus of text document
Output:
● List of topics (max 10 to 40)
● Human description for each topic
● Size of each topic
What this talk is about:
● Help you get quickly a rough idea of what this content is about
● No requirements that you are a master of deep learning concepts, fancy maths
How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
Input: “hey, how are you?”
Output: [“hey”, “how”, “are”, “you”, “?”]
N documents
How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
Input: [“hey”, “how”, “are”, “you”, “?”]
Output: M-sized vector [0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...]
N documents, M distinct tokens (dictionary size)
Input: N * M matrix: [[0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...], …]
Output: N vector: [2, 4, 2, 1, 8, …, 9, 3, 0]
How does this work ?
Text
documents
Tokenized
documents
Vectorized
documents
Document/topic
mapping
N documents, M distinct tokens (dictionary size), K topics
The code
On github: https://github.com/mathieu-lacage/sophiaconf2018
1. Collect a dataset do-collect.py -k france
2. Tokenize text do-tokenize.py --lang=fr
3. Calculate document frequencies do-df.py --min-df=1
4. Generate document vectors do-bag-of-words.py --model=boolean
5. Cluster vectors do-kmeans.py -k 10
6. Visualize the clusters do-summarize.py
Step 1: collect a dataset
do-collect.py -k france
“Sample” Twitter stream:
● 1% of all tweets which contain the word “france”
● ran a couple of hours on June 25th
Be careful:
● Hardcoded twitter app ids
● Generate your own app ids: https://apps.twitter.com/ !
Step 2: tokenize the text input
do-tokenize.py --lang=fr
Depends on language
● “Easy” for english: spaces, hyphens are word boundaries.
● CJKV languages: no space. (tough)
→ We focus on a “simple” language and open-source library (NLTK) to ignore the
problem
Step 3: calculate document frequencies
do-df.py --min-df=1
Number of documents which contain each token at least once
Eliminate all tokens which appear only once
Store number of documents as a special zero-length string token
[-1, "", 10842]
[0, "https://t.co/lzpNXIe2if", 1]
Step 4: generate document vectors
do-bag-of-words.py --model=boolean
Models
● boolean: the simplest model: 1 if token is present in document, 0 otherwise
● tf-idf: More weight for tokens which appear rarely in corpus
→ we start with the simplest option !
Step 5: Cluster document vectors
do-kmeans.py -k=10
Search 10 clusters:
● Complexity = O(nmk+1
) → hurts
● MiniBatch option is much faster but less stable numerically
● What you really want is reduce M (curse of dimensionality)
Step 6: visualize the clusters
do-summarize.py
Keep the tokens where the difference between:
● Frequency of tokens in cluster
● Frequency of tokens in corpus
Is highest
→ Inspired by KL divergence
Results
0. 3165 MAIS PERSONNE https://t.co/Xg4fOi9Q1c ACCOSTER #TraduisonsLes
1. 2407 prenne égalera battra protéger entrer
2. 255 bousiller travaillé aies gar jaloux
3. 372 262 légaux 3A https://t.co/WyunDG4wLs optim
4. 896 tchadien Tchad zénith annonçons lor
5. 110 traiter https://t.co/zCAlZJjzfX rt pute met
6. 326 GAGNANTS https://t.co/1XGv3j526K PASSE PayPal
7. 2598 Mauvais marquage Archives-Verrerie chuter Générosités
8. 242 https://t.co/byRBwkSa3U Faire l'île
9. 471 altitude giflée bled baisser Francais
Comments
Small clusters are pretty coherent
Big clusters are a mix of lots of small clusters
→ Choosing a good K is crucial !
● Too small: mishmash of topics
● Too big: many small clusters which are all about the same topic
Things you could do
1. More/different data
2. Compare accuracy loss of MiniBatchKMeans against kMeans
3. Test other clustering algorithms
4. Better summarization
5. Visualize topic relationships
6. Compare LSA and LDA to Clustering output
7. Automatically pick number of topics by optimizing for silhouette coefficient
8. Decrease dimensionality with doc2vec, LSA or LDA to replace bag-of-words
9. ...
Questions ?
Dimensionality reduction: word2vec
python ./do-word-vector-model.py -d sample-big
mv sample-big-word-vector-mode sample-word-vector-model
python ./do-doc2vec.py
“Distributed Representations of Words and Phrases and their Compositionality”, 2013
Open source implementation: gensim

More Related Content

Similar to SophiaConf 2018 - M. LACAGE (ALCMEON)

""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...
Dataconomy Media
 
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
Yogi Sharo
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
tkisason
 
Dita ot pipeline webinar
Dita ot pipeline webinarDita ot pipeline webinar
Dita ot pipeline webinar
Suite Solutions
 

Similar to SophiaConf 2018 - M. LACAGE (ALCMEON) (20)

Intro
IntroIntro
Intro
 
4Developers: Tomasz Ducin- JavaScript + Java = TypeScript
4Developers: Tomasz Ducin- JavaScript + Java = TypeScript4Developers: Tomasz Ducin- JavaScript + Java = TypeScript
4Developers: Tomasz Ducin- JavaScript + Java = TypeScript
 
Data Science Workshop
Data Science WorkshopData Science Workshop
Data Science Workshop
 
""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...""Into the Wild" ... with Natural Language Processing and Text Classification...
""Into the Wild" ... with Natural Language Processing and Text Classification...
 
Into the Wild - wilth Natural Language Processing and Text Classification - D...
Into the Wild - wilth Natural Language Processing and Text Classification - D...Into the Wild - wilth Natural Language Processing and Text Classification - D...
Into the Wild - wilth Natural Language Processing and Text Classification - D...
 
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
Adaptive Thread Scheduling Techniques for Improving Scalability of Software T...
 
50 Tech Tips Webinar Slides
50 Tech Tips Webinar Slides50 Tech Tips Webinar Slides
50 Tech Tips Webinar Slides
 
TextMining with R
TextMining with RTextMining with R
TextMining with R
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
[Gary entsminger] turbo_pascal_for_windows_bible(book_fi.org)
 
Coding Like the Wind - Tips and Tricks for the Microsoft Visual Studio 2012 C...
Coding Like the Wind - Tips and Tricks for the Microsoft Visual Studio 2012 C...Coding Like the Wind - Tips and Tricks for the Microsoft Visual Studio 2012 C...
Coding Like the Wind - Tips and Tricks for the Microsoft Visual Studio 2012 C...
 
GDG Helwan Introduction to python
GDG Helwan Introduction to pythonGDG Helwan Introduction to python
GDG Helwan Introduction to python
 
Text Indexing / Inverted Indices
Text Indexing / Inverted IndicesText Indexing / Inverted Indices
Text Indexing / Inverted Indices
 
OpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internetOpenFest 2012 : Leveraging the public internet
OpenFest 2012 : Leveraging the public internet
 
Dita ot pipeline webinar
Dita ot pipeline webinarDita ot pipeline webinar
Dita ot pipeline webinar
 
Illustrated Code (ASE 2021)
Illustrated Code (ASE 2021)Illustrated Code (ASE 2021)
Illustrated Code (ASE 2021)
 
2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf2022 - Delivering Powerful Technical Presentations.pdf
2022 - Delivering Powerful Technical Presentations.pdf
 
Unleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and InsightUnleashing Twitter Data for Fun and Insight
Unleashing Twitter Data for Fun and Insight
 
Unleashing twitter data for fun and insight
Unleashing twitter data for fun and insightUnleashing twitter data for fun and insight
Unleashing twitter data for fun and insight
 
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel ZikmundNDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
NDC Oslo 2019 - War stories from .NET team -- Karel Zikmund
 

More from TelecomValley

More from TelecomValley (20)

Rapport d'activité SoFAB 2022
Rapport d'activité SoFAB 2022Rapport d'activité SoFAB 2022
Rapport d'activité SoFAB 2022
 
Rapport d'activité 2022
Rapport d'activité 2022Rapport d'activité 2022
Rapport d'activité 2022
 
Rapport d'activité 2021 - Telecom Valley
Rapport d'activité 2021 - Telecom ValleyRapport d'activité 2021 - Telecom Valley
Rapport d'activité 2021 - Telecom Valley
 
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...
Livre blanc "Les métamorphoses de l'entreprise face à l'imprévu - Tome 1 : la...
 
Rapport d'activité SoFAB 2020
Rapport d'activité SoFAB 2020Rapport d'activité SoFAB 2020
Rapport d'activité SoFAB 2020
 
Rapport d'activité Telecom Valley 2020
Rapport d'activité Telecom Valley 2020Rapport d'activité Telecom Valley 2020
Rapport d'activité Telecom Valley 2020
 
Rapport d'activité SoFAB 2019
Rapport d'activité SoFAB 2019Rapport d'activité SoFAB 2019
Rapport d'activité SoFAB 2019
 
Rapport d'activité Telecom Valley 2019
Rapport d'activité Telecom Valley 2019Rapport d'activité Telecom Valley 2019
Rapport d'activité Telecom Valley 2019
 
Revue de presse Telecom Valley - Février 2020
Revue de presse Telecom Valley - Février 2020Revue de presse Telecom Valley - Février 2020
Revue de presse Telecom Valley - Février 2020
 
Revue de presse Telecom Valley - Janvier 2020
Revue de presse Telecom Valley - Janvier 2020Revue de presse Telecom Valley - Janvier 2020
Revue de presse Telecom Valley - Janvier 2020
 
Revue de presse Telecom Valley - Décembre 2019
Revue de presse Telecom Valley - Décembre 2019Revue de presse Telecom Valley - Décembre 2019
Revue de presse Telecom Valley - Décembre 2019
 
Revue de presse Telecom Valley - Novembre 2019
Revue de presse Telecom Valley - Novembre 2019Revue de presse Telecom Valley - Novembre 2019
Revue de presse Telecom Valley - Novembre 2019
 
Revue de presse Telecom Valley - Octobre 2019
Revue de presse Telecom Valley - Octobre 2019Revue de presse Telecom Valley - Octobre 2019
Revue de presse Telecom Valley - Octobre 2019
 
Revue de presse Telecom Valley - Septembre 2019
Revue de presse Telecom Valley - Septembre 2019Revue de presse Telecom Valley - Septembre 2019
Revue de presse Telecom Valley - Septembre 2019
 
Présentation Team France Export régionale - 29/11/19
Présentation Team France Export régionale - 29/11/19Présentation Team France Export régionale - 29/11/19
Présentation Team France Export régionale - 29/11/19
 
2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...
2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...
2019 - NOURI - ALL4TEST- Le BDD pour decouvrir et specifier les besoins metie...
 
Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...
Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...
Tester c'est bien, monitorer c'est mieux - 2019 - KISSI - Soirée du Test Logi...
 
Et si mon test était la spécification de mon application ? - JACOB - iWE - So...
Et si mon test était la spécification de mon application ? - JACOB - iWE - So...Et si mon test était la spécification de mon application ? - JACOB - iWE - So...
Et si mon test était la spécification de mon application ? - JACOB - iWE - So...
 
A la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFE
A la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFEA la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFE
A la poursuite du bug perdu - 2019 - THEAULT - DI GIORGIO - ACPQUALIFE
 
2019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.1
2019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.12019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.1
2019 - HAGE CHAHINE - ALTRAN - Presentation-DecouverteMondeAgile_V1.1
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

SophiaConf 2018 - M. LACAGE (ALCMEON)

  • 1. Mining text data for topics Aka: Unsupervised clustering mathieu.lacage@alcmeon.com
  • 2. The objective Input: corpus of text document Output: ● List of topics (max 10 to 40) ● Human description for each topic ● Size of each topic What this talk is about: ● Help you get quickly a rough idea of what this content is about ● No requirements that you are a master of deep learning concepts, fancy maths
  • 3. How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping
  • 4. How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping Input: “hey, how are you?” Output: [“hey”, “how”, “are”, “you”, “?”] N documents
  • 5. How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping Input: [“hey”, “how”, “are”, “you”, “?”] Output: M-sized vector [0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...] N documents, M distinct tokens (dictionary size)
  • 6. Input: N * M matrix: [[0, 0, …, 0, 1, 0, ... , 0, 1, 0, ...], …] Output: N vector: [2, 4, 2, 1, 8, …, 9, 3, 0] How does this work ? Text documents Tokenized documents Vectorized documents Document/topic mapping N documents, M distinct tokens (dictionary size), K topics
  • 7. The code On github: https://github.com/mathieu-lacage/sophiaconf2018 1. Collect a dataset do-collect.py -k france 2. Tokenize text do-tokenize.py --lang=fr 3. Calculate document frequencies do-df.py --min-df=1 4. Generate document vectors do-bag-of-words.py --model=boolean 5. Cluster vectors do-kmeans.py -k 10 6. Visualize the clusters do-summarize.py
  • 8. Step 1: collect a dataset do-collect.py -k france “Sample” Twitter stream: ● 1% of all tweets which contain the word “france” ● ran a couple of hours on June 25th Be careful: ● Hardcoded twitter app ids ● Generate your own app ids: https://apps.twitter.com/ !
  • 9. Step 2: tokenize the text input do-tokenize.py --lang=fr Depends on language ● “Easy” for english: spaces, hyphens are word boundaries. ● CJKV languages: no space. (tough) → We focus on a “simple” language and open-source library (NLTK) to ignore the problem
  • 10. Step 3: calculate document frequencies do-df.py --min-df=1 Number of documents which contain each token at least once Eliminate all tokens which appear only once Store number of documents as a special zero-length string token [-1, "", 10842] [0, "https://t.co/lzpNXIe2if", 1]
  • 11. Step 4: generate document vectors do-bag-of-words.py --model=boolean Models ● boolean: the simplest model: 1 if token is present in document, 0 otherwise ● tf-idf: More weight for tokens which appear rarely in corpus → we start with the simplest option !
  • 12. Step 5: Cluster document vectors do-kmeans.py -k=10 Search 10 clusters: ● Complexity = O(nmk+1 ) → hurts ● MiniBatch option is much faster but less stable numerically ● What you really want is reduce M (curse of dimensionality)
  • 13. Step 6: visualize the clusters do-summarize.py Keep the tokens where the difference between: ● Frequency of tokens in cluster ● Frequency of tokens in corpus Is highest → Inspired by KL divergence
  • 14. Results 0. 3165 MAIS PERSONNE https://t.co/Xg4fOi9Q1c ACCOSTER #TraduisonsLes 1. 2407 prenne égalera battra protéger entrer 2. 255 bousiller travaillé aies gar jaloux 3. 372 262 légaux 3A https://t.co/WyunDG4wLs optim 4. 896 tchadien Tchad zénith annonçons lor 5. 110 traiter https://t.co/zCAlZJjzfX rt pute met 6. 326 GAGNANTS https://t.co/1XGv3j526K PASSE PayPal 7. 2598 Mauvais marquage Archives-Verrerie chuter Générosités 8. 242 https://t.co/byRBwkSa3U Faire l'île 9. 471 altitude giflée bled baisser Francais
  • 15. Comments Small clusters are pretty coherent Big clusters are a mix of lots of small clusters → Choosing a good K is crucial ! ● Too small: mishmash of topics ● Too big: many small clusters which are all about the same topic
  • 16. Things you could do 1. More/different data 2. Compare accuracy loss of MiniBatchKMeans against kMeans 3. Test other clustering algorithms 4. Better summarization 5. Visualize topic relationships 6. Compare LSA and LDA to Clustering output 7. Automatically pick number of topics by optimizing for silhouette coefficient 8. Decrease dimensionality with doc2vec, LSA or LDA to replace bag-of-words 9. ...
  • 18. Dimensionality reduction: word2vec python ./do-word-vector-model.py -d sample-big mv sample-big-word-vector-mode sample-word-vector-model python ./do-doc2vec.py “Distributed Representations of Words and Phrases and their Compositionality”, 2013 Open source implementation: gensim