Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
2. Topic Modeling
Topic modeling: is technique to uncover the underlying topic from the
document, in simple words it helps to identify what the document is
talking about, the important topics in the article.
Types of Topic Models
1) Latent Semantic Indexing (LSI)
2) Laten Dirichlet Allocation (LDA)
3) Probalistic Latent Semantic Indexing (PLSI)
Document topic words
Rupak Roy
3. Topic Modeling - LDA
Topics Technology Healthcare Business
%topics in the
documents 30 % 60% 17%
Bag of words Google, Dell Radiology, Transactions,
Apple, Microsoft Diagnose Bank, Cost
DOCUMENT
Behind LDA
Topic 1: Technology: Google, Dell, Apple, Microsoft
Topic 2: Healthcare: Radiology, Diagnose, Ct Scan
Topic 3: Business: Transactions, Banks, Cost.
Rupak Roy
4. Topic Modeling
How often does “Diagnose appear in topic Healthcare ?
If the ‘Diagnose’ word often occurs in the Topic Healthcare, then this
instance of ‘Diagnose’ might belong to the topic Healthcare.
Now how common is the topic healthcare in the rest of the document?
This is actually similar to Bayes theorem.
To find the probability of possible topic T
Multiply the frequency of the word type W in T by the number of other
words in document D that already belong to T
Therefore the output is
The probability that this word came from topic T=>
=> P(TW,D) = )words W in the topic T/words in the document )* words
in D that belong to T
Rupak Roy
6. Topic Modeling - LDA
#Create a Document Term Matrix
matrix= create_matrix(cbind(as.vector(tweets1$airline),as.vector(tweets1$tweets)),
language="english",removeNumbers=TRUE, removePunctuation=TRUE,
removeSparseTerms=0,
removeStopwords=TRUE, stripWhitespace=TRUE, toLower=TRUE)
inspect(tweets.corpus[1:5])
#Choose the number of topics
k<- 15
#Split the Data into training and testing
#We will take a small subset of data
train <- matrix[1:500,]
test <- matrix[501:750,]
#train <- matrix[1:10248,]
#test <- matrix[10249:1460,]
Rupak Roy
7. Topic Modeling - LDA
#Build the model on train data
train.lda <- LDA(train,k)
topics<-get_topics(train.lda,5)
View(topics)
#by default it gives the highest topic with the document
terms<-get_terms(train.lda,5)
View(terms)
#by default it gives the most highly probable word in each topic
#Get the top topics
train.topics <- topics(train.lda)
#Test the model
test.topics <- posterior(train.lda,test)
test.topics$topics[1:10,1:15]
#[row, number of topics(upto 15topics)that is the value of K =15]
test.topics <- apply(test.topics$topics, 1, which.max)
#gives topic with highest probability
Rupak Roy
8. Topic Modeling - LDA
#Join the predicted Topic number to the original test Data
test1<-tweets[501:750,]
final<-data.frame(Title=test1$airline,Subject=test1$text,
Pred.topic=test.topics)
View(final)
table(final$Pred.topic)
#View each topic
View(final[final$Pred.topic==10,])
Rupak Roy
9. Topic Modeling - LDA
#---------------Another method to get the optimal number of topics ---------#
library(topicmodel)
best.model <- lapply(seq(2,20, by=1), function(k){LDA(matrix,k)})
#seq(2,20) refers range of K values
best_model<- as.data.frame(as.matrix(lapply(best.model, logLik)))
#one of the methods to measure the performance is loglikehood & to find out
#whether a model is good model or average model or bad model based on the
parameter model uses.
final_best_model <- data.frame(topics=c(seq(2,20, by=1)),
log_likelihood=as.numeric(as.matrix(best_model)))
#The higher the loglikelihood the better the model.
#finds out ideal topic for every doc
head(final_best_model)
library(ggplot2)
with(final_best_model,qplot(topics,log_likelihood,color="red"))
#the higher the likelihood value in the graph better the topics are.
Rupak Roy
10. Topic Modeling - LDA
#Get the best value from the graph
k=final_best_model[which.max(final_best_model$log_likelihood),1]
cat("Best topic number k=",k)
Rupak Roy
11. Steps Topic Modeling
1) Data
2) Create TDM
3) Choose number of topics (K)
4) Divide the data into train & test
5) Building model on train data
6) Get the topic
7) Test the model
8) Joining the predicted Topic Number to the original dataset
9) Analyize
Rupak Roy