This document proposes a new variational autoencoder (VAE) approach for topic modeling that addresses the issue of latent variable collapse. The proposed VAE models each word token separately using a context-dependent sampling approach. It minimizes a KL divergence term not considered in previous VAEs for topic modeling. An experiment on four large datasets found the proposed VAE improved over existing VAEs for about half the datasets in terms of perplexity or normalized pairwise mutual information.
1. The document proposes Granulated LDA (GLDA), a regularized version of LDA, to improve topic modeling stability.
2. It introduces measures like Kullback-Leibler divergence and Jaccard coefficient to evaluate topic similarity and modeling stability across runs.
3. An experiment applies LDA, SLDA, and GLDA to a large Russian text corpus, finding that GLDA produces more stable topics across multiple runs according to these measures.
Topic models are probabilistic models for discovering the underlying semantic structure of a document collection based on a hierarchical Bayesian analysis. Latent Dirichlet allocation (LDA) is a commonly used topic model that represents documents as mixtures of topics and topics as distributions over words. LDA uses Gibbs sampling to estimate the posterior distribution over topic assignments given the words in each document.
The document describes an interactive Latent Dirichlet Allocation (LDA) model that allows users to provide feedback to guide the topic modeling process. It summarizes previous work using constraints to encode feedback. It then introduces an approach using variational EM for LDA that allows modifying the topic distributions between epochs based on user feedback, such as removing words from topics, deleting topics, merging topics, or splitting topics. The interactive LDA approach alternates between running LDA to convergence and applying user updates to the topic distributions.
This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
This document provides an overview of topic modeling. It defines topic modeling as discovering the thematic structure of a corpus by modeling relationships between words and documents through learned topics. The document introduces Latent Dirichlet Allocation (LDA) as a widely used topic modeling technique. It outlines LDA's generative process and inference methods like Gibbs sampling and variational inference. The document also discusses extensions to LDA, evaluation strategies, open questions, and applications like topic labeling and browsing.
Probabilistic information retrieval models & systemsSelman Bozkır
The document discusses probabilistic information retrieval and Bayesian approaches. It introduces concepts like conditional probability, Bayes' theorem, and the probability ranking principle. It explains how probabilistic models estimate the probability of relevance between a document and query by representing them as term sets and making probabilistic assumptions. The goal is to rank documents by the probability of relevance to present the most likely relevant documents first.
This document discusses topic models, LDA, and related concepts. It begins with an overview of LDA and how it uses a graphical model approach for unsupervised learning. Various inference methods for LDA are discussed, including variational inference and Gibbs sampling. The document also covers extensions like correlated topic models and dynamic topic models, as well as applications and researchers in the field. Key concepts covered include posterior approximation, sampling, variational methods, and optimization.
1. The document proposes Granulated LDA (GLDA), a regularized version of LDA, to improve topic modeling stability.
2. It introduces measures like Kullback-Leibler divergence and Jaccard coefficient to evaluate topic similarity and modeling stability across runs.
3. An experiment applies LDA, SLDA, and GLDA to a large Russian text corpus, finding that GLDA produces more stable topics across multiple runs according to these measures.
Topic models are probabilistic models for discovering the underlying semantic structure of a document collection based on a hierarchical Bayesian analysis. Latent Dirichlet allocation (LDA) is a commonly used topic model that represents documents as mixtures of topics and topics as distributions over words. LDA uses Gibbs sampling to estimate the posterior distribution over topic assignments given the words in each document.
The document describes an interactive Latent Dirichlet Allocation (LDA) model that allows users to provide feedback to guide the topic modeling process. It summarizes previous work using constraints to encode feedback. It then introduces an approach using variational EM for LDA that allows modifying the topic distributions between epochs based on user feedback, such as removing words from topics, deleting topics, merging topics, or splitting topics. The interactive LDA approach alternates between running LDA to convergence and applying user updates to the topic distributions.
This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
This document provides an overview of topic modeling. It defines topic modeling as discovering the thematic structure of a corpus by modeling relationships between words and documents through learned topics. The document introduces Latent Dirichlet Allocation (LDA) as a widely used topic modeling technique. It outlines LDA's generative process and inference methods like Gibbs sampling and variational inference. The document also discusses extensions to LDA, evaluation strategies, open questions, and applications like topic labeling and browsing.
Probabilistic information retrieval models & systemsSelman Bozkır
The document discusses probabilistic information retrieval and Bayesian approaches. It introduces concepts like conditional probability, Bayes' theorem, and the probability ranking principle. It explains how probabilistic models estimate the probability of relevance between a document and query by representing them as term sets and making probabilistic assumptions. The goal is to rank documents by the probability of relevance to present the most likely relevant documents first.
This document discusses topic models, LDA, and related concepts. It begins with an overview of LDA and how it uses a graphical model approach for unsupervised learning. Various inference methods for LDA are discussed, including variational inference and Gibbs sampling. The document also covers extensions like correlated topic models and dynamic topic models, as well as applications and researchers in the field. Key concepts covered include posterior approximation, sampling, variational methods, and optimization.
The document discusses various techniques for information retrieval and language modeling approaches to IR, including:
- Clustering documents into similar groups to aid in retrieval
- Using term frequency-inverse document frequency (TF-IDF) to measure word importance in documents
- Language models that represent documents and queries as probability distributions over words
- Smoothing language models to address data sparsity issues
- Cluster-based scoring methods that incorporate information from query-relevant document clusters
The document discusses probabilistic retrieval models in information retrieval. It provides an overview of older models like Boolean retrieval and vector space models. The main focus is on probabilistic models like BM25 and language models. It explains key concepts in probabilistic IR like the probability ranking principle, using Bayes' rule to estimate the probability that a document is relevant given features of the document, and estimating probabilities based on the frequencies of terms in relevant documents. The goal is to rank documents based on the probability of relevance to the query.
This document discusses incorporating probabilistic retrieval knowledge into TFIDF-based search engines. It provides an overview of different retrieval models such as Boolean, vector space, probabilistic, and language models. It then describes using a probabilistic model that estimates the probability of a document being relevant or non-relevant given its terms. This model can be combined with the BM25 ranking algorithm. The document proposes applying probabilistic knowledge to different document fields during ranking to improve relevance.
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
Video (2014): http://videolectures.net/kdd2014_li_sampling_complexity/
This paper presents an approximate sampler for topic models that theoretically and experimentally outperforms existing samplers thereby allowing topic models to scale to industry-scale datasets.
This document provides an overview of the Introduction to Algorithms course, including the course modules and motivating problems. It introduces the Document Distance problem, which aims to define metrics to measure the similarity between documents based on word frequencies. It discusses an initial Python program ("docdist1.py") to calculate document distance that runs inefficiently due to quadratic time list concatenation. Profiling identifies this as the bottleneck. The solution is to use list extension, resulting in "docdist3.py". Further optimizations include using a dictionary to count word frequencies in constant time, creating "docdist4.py". The document outlines remaining opportunities like improving the word extraction and sorting algorithms.
This document discusses vector space retrieval models. It describes how documents and queries are represented as vectors in a common vector space based on terms. Terms are weighted using metrics like term frequency (TF) and inverse document frequency (IDF) to determine importance. The cosine similarity measure is used to calculate similarity between document and query vectors and rank results by relevance. While simple and effective in practice, vector space models have limitations like missing semantic and syntactic information.
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
This document summarizes a presentation on using string kernels for text classification. It introduces text classification and the challenge of representing text documents as feature vectors. It then discusses how kernel methods can be used as an alternative, by mapping documents into a feature space without explicitly extracting features. Different string kernel algorithms are described that measure similarity between documents based on common subsequences of characters. The document evaluates the performance of these kernels on a text dataset and explores ways to improve efficiency, such as through kernel approximation.
Brute force searching, the typical set and guessworkLisandro Mierez
Consider the problem of identifying the value of a discrete
random variable by only asking questions of the sort: is its
value X? That this is a time-consuming task is a cornerstone
of computationally secure ciphers
This document provides an overview of topic modeling and Latent Dirichlet Allocation (LDA). It begins by discussing Aristotle's definition of topics as headings under which arguments fall. It then explains LDA's view of topics as distributions of co-occurring words. The document outlines the parameters and process of LDA, including variational inference using the Expectation-Maximization algorithm to estimate topic distributions and document-topic distributions. It concludes by describing how to compute the likelihood of LDA models.
(1) The document describes probabilistic methods for structured document classification that were submitted to the INEX'07 Document Mining track.
(2) It evaluates five runs using naive Bayes and OR gate Bayesian network classifiers with different document representations and feature selection techniques.
(3) The best run used an OR gate classifier with a better weight approximation on only text, achieving a microaverage of 0.78998 and macroaverage of 0.76054.
The document discusses different techniques for topic modeling of documents, including TF-IDF weighting and cosine similarity. It proposes a semi-supervised approach that uses predefined topics from Prismatic to train an LDA model on Wikipedia articles. This model classifies news articles into topics. The accuracy is improved by redistributing term weights based on their relevance within topic clusters rather than just document frequency. An experiment on over 5000 news articles found that the combined weighting approach outperformed TF-IDF alone on articles with multiple topics or limited content.
The document describes a data structure called a Compact Dynamic Rewritable Array (CDRW) that compactly stores arrays where each entry can be dynamically rewritten. It supports creating an array of size N where each entry is initially 0 bits, setting an entry to a value of at most k bits, and getting an entry's value. The goal is to use close to the minimum possible space of the sum of each entry's length while supporting these operations in O(1) time. The document presents solutions using compact hashing that achieve O(1) time for get and set using (1+) times the minimum space plus O(N) bits, for any constant >0. Experimental results show these perform well in terms
This document summarizes text classification in PHP. It discusses what text classification is, common natural language processing terminology like tokenization and stemming, Bayes' theorem and how it relates to naive Bayes classification. It provides examples of tokenizing, stemming, stopping words, and building a naive Bayes classifier in PHP using the NlpTools library. Key steps like training and testing a classifier on sample text data are demonstrated.
Local Closed World Semantics - DL 2011 PosterAdila Krisnadhi
The document proposes a local closed world semantics called grounded circumscription (GC) for description logics. GC simplifies classical circumscription by restricting the extensions of minimized predicates to named individuals, ensuring decidability. It defines GC models and logical consequences for a knowledge base under a set of minimized concepts and roles. The document proves that GC knowledge base satisfiability remains decidable if the underlying description logic is decidable, and discusses future work on tableau algorithms and other inference problems under GC semantics.
This document provides an overview of communication complexity and deterministic communication complexity. It begins with defining the problem setup, which involves two parties (Alice and Bob) trying to compute a function f based on their private inputs using the minimum communication. It then discusses protocol trees, which model communication protocols, and shows they partition the input space into rectangles. It introduces combinatorial rectangles and shows how the minimum number of rectangles needed to partition the space can provide a lower bound on communication complexity. The document also discusses fooling sets, another technique to lower bound communication complexity, and previews upcoming topics on nondeterministic and randomized communication complexity.
Research Summary: Hidden Topic Markov Models, GruberAlex Klibisz
This document summarizes the Hidden Topic Markov Model (HTMM) proposed by Gruber et al. HTMM extends latent Dirichlet allocation (LDA) by adding a Markovian assumption that topics only transition at sentence boundaries. HTMM outperforms LDA on a dataset of NIPS papers by learning more coherent topics that better disambiguate words belonging to multiple topics. The HTMM model is trained using an expectation-maximization algorithm, and experiments show HTMM has lower perplexity than LDA, indicating it better predicts unseen documents.
Extractive Document Summarization - An Unsupervised ApproachFindwise
1. This paper presents and evaluates an unsupervised extractive document summarization system that uses TextRank, K-means clustering, and one-class SVM algorithms for sentence ranking.
2. The system achieves state-of-the-art performance on the DUC 2002 English dataset with a ROUGE score of 0.4797 and can also summarize Swedish documents.
3. Domain knowledge is added through sentence boosting to improve summarization of news articles, and similarities between sentences are calculated to avoid redundancy for multi-document summarization.
Patch models and sparse decompositions of image patches. Dictionary learning and the k-SVD algorithm. Collaborative filtering and BM3D. Non-local sparse based models. Expected patch log-likelihood. Other applications of patch models in inpainting, super-resolution and deblurring.
Coclustering Base Classification For Out Of Domain Documentslau
This document presents a co-clustering based classification algorithm (CoCC) for classifying documents from a related but different domain (out-of-domain documents) by utilizing labeled documents from another domain (in-domain documents). CoCC aims to simultaneously cluster out-of-domain documents and words to minimize the loss of mutual information, outperforming traditional supervised and semi-supervised algorithms. While CoCC achieved good performance, its time complexity can be inefficient due to the large number of word clusters. Future work will focus on speeding up the algorithm.
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
The document discusses topic models like Latent Dirichlet Allocation (LDA) and Correlated Topic Models (CTM). LDA is a generative probabilistic model that can discover topics in a collection of documents and represent documents as mixtures over latent topics. CTM extends LDA by allowing topic correlations, since LDA assumes topic independence. CTM models topic proportions using a logistic normal distribution rather than LDA's Dirichlet distribution, allowing dependencies between topics.
This document discusses probabilistic models used for text mining. It introduces mixture models, Bayesian nonparametric models, and graphical models including Bayesian networks, hidden Markov models, Markov random fields, and conditional random fields. It provides details on the general framework of mixture models and examples like topic models PLSA and LDA. It also discusses learning algorithms for probabilistic models like EM algorithm and Gibbs sampling.
The document discusses various techniques for information retrieval and language modeling approaches to IR, including:
- Clustering documents into similar groups to aid in retrieval
- Using term frequency-inverse document frequency (TF-IDF) to measure word importance in documents
- Language models that represent documents and queries as probability distributions over words
- Smoothing language models to address data sparsity issues
- Cluster-based scoring methods that incorporate information from query-relevant document clusters
The document discusses probabilistic retrieval models in information retrieval. It provides an overview of older models like Boolean retrieval and vector space models. The main focus is on probabilistic models like BM25 and language models. It explains key concepts in probabilistic IR like the probability ranking principle, using Bayes' rule to estimate the probability that a document is relevant given features of the document, and estimating probabilities based on the frequencies of terms in relevant documents. The goal is to rank documents based on the probability of relevance to the query.
This document discusses incorporating probabilistic retrieval knowledge into TFIDF-based search engines. It provides an overview of different retrieval models such as Boolean, vector space, probabilistic, and language models. It then describes using a probabilistic model that estimates the probability of a document being relevant or non-relevant given its terms. This model can be combined with the BM25 ranking algorithm. The document proposes applying probabilistic knowledge to different document fields during ranking to improve relevance.
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
Video (2014): http://videolectures.net/kdd2014_li_sampling_complexity/
This paper presents an approximate sampler for topic models that theoretically and experimentally outperforms existing samplers thereby allowing topic models to scale to industry-scale datasets.
This document provides an overview of the Introduction to Algorithms course, including the course modules and motivating problems. It introduces the Document Distance problem, which aims to define metrics to measure the similarity between documents based on word frequencies. It discusses an initial Python program ("docdist1.py") to calculate document distance that runs inefficiently due to quadratic time list concatenation. Profiling identifies this as the bottleneck. The solution is to use list extension, resulting in "docdist3.py". Further optimizations include using a dictionary to count word frequencies in constant time, creating "docdist4.py". The document outlines remaining opportunities like improving the word extraction and sorting algorithms.
This document discusses vector space retrieval models. It describes how documents and queries are represented as vectors in a common vector space based on terms. Terms are weighted using metrics like term frequency (TF) and inverse document frequency (IDF) to determine importance. The cosine similarity measure is used to calculate similarity between document and query vectors and rank results by relevance. While simple and effective in practice, vector space models have limitations like missing semantic and syntactic information.
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
This document summarizes a presentation on using string kernels for text classification. It introduces text classification and the challenge of representing text documents as feature vectors. It then discusses how kernel methods can be used as an alternative, by mapping documents into a feature space without explicitly extracting features. Different string kernel algorithms are described that measure similarity between documents based on common subsequences of characters. The document evaluates the performance of these kernels on a text dataset and explores ways to improve efficiency, such as through kernel approximation.
Brute force searching, the typical set and guessworkLisandro Mierez
Consider the problem of identifying the value of a discrete
random variable by only asking questions of the sort: is its
value X? That this is a time-consuming task is a cornerstone
of computationally secure ciphers
This document provides an overview of topic modeling and Latent Dirichlet Allocation (LDA). It begins by discussing Aristotle's definition of topics as headings under which arguments fall. It then explains LDA's view of topics as distributions of co-occurring words. The document outlines the parameters and process of LDA, including variational inference using the Expectation-Maximization algorithm to estimate topic distributions and document-topic distributions. It concludes by describing how to compute the likelihood of LDA models.
(1) The document describes probabilistic methods for structured document classification that were submitted to the INEX'07 Document Mining track.
(2) It evaluates five runs using naive Bayes and OR gate Bayesian network classifiers with different document representations and feature selection techniques.
(3) The best run used an OR gate classifier with a better weight approximation on only text, achieving a microaverage of 0.78998 and macroaverage of 0.76054.
The document discusses different techniques for topic modeling of documents, including TF-IDF weighting and cosine similarity. It proposes a semi-supervised approach that uses predefined topics from Prismatic to train an LDA model on Wikipedia articles. This model classifies news articles into topics. The accuracy is improved by redistributing term weights based on their relevance within topic clusters rather than just document frequency. An experiment on over 5000 news articles found that the combined weighting approach outperformed TF-IDF alone on articles with multiple topics or limited content.
The document describes a data structure called a Compact Dynamic Rewritable Array (CDRW) that compactly stores arrays where each entry can be dynamically rewritten. It supports creating an array of size N where each entry is initially 0 bits, setting an entry to a value of at most k bits, and getting an entry's value. The goal is to use close to the minimum possible space of the sum of each entry's length while supporting these operations in O(1) time. The document presents solutions using compact hashing that achieve O(1) time for get and set using (1+) times the minimum space plus O(N) bits, for any constant >0. Experimental results show these perform well in terms
This document summarizes text classification in PHP. It discusses what text classification is, common natural language processing terminology like tokenization and stemming, Bayes' theorem and how it relates to naive Bayes classification. It provides examples of tokenizing, stemming, stopping words, and building a naive Bayes classifier in PHP using the NlpTools library. Key steps like training and testing a classifier on sample text data are demonstrated.
Local Closed World Semantics - DL 2011 PosterAdila Krisnadhi
The document proposes a local closed world semantics called grounded circumscription (GC) for description logics. GC simplifies classical circumscription by restricting the extensions of minimized predicates to named individuals, ensuring decidability. It defines GC models and logical consequences for a knowledge base under a set of minimized concepts and roles. The document proves that GC knowledge base satisfiability remains decidable if the underlying description logic is decidable, and discusses future work on tableau algorithms and other inference problems under GC semantics.
This document provides an overview of communication complexity and deterministic communication complexity. It begins with defining the problem setup, which involves two parties (Alice and Bob) trying to compute a function f based on their private inputs using the minimum communication. It then discusses protocol trees, which model communication protocols, and shows they partition the input space into rectangles. It introduces combinatorial rectangles and shows how the minimum number of rectangles needed to partition the space can provide a lower bound on communication complexity. The document also discusses fooling sets, another technique to lower bound communication complexity, and previews upcoming topics on nondeterministic and randomized communication complexity.
Research Summary: Hidden Topic Markov Models, GruberAlex Klibisz
This document summarizes the Hidden Topic Markov Model (HTMM) proposed by Gruber et al. HTMM extends latent Dirichlet allocation (LDA) by adding a Markovian assumption that topics only transition at sentence boundaries. HTMM outperforms LDA on a dataset of NIPS papers by learning more coherent topics that better disambiguate words belonging to multiple topics. The HTMM model is trained using an expectation-maximization algorithm, and experiments show HTMM has lower perplexity than LDA, indicating it better predicts unseen documents.
Extractive Document Summarization - An Unsupervised ApproachFindwise
1. This paper presents and evaluates an unsupervised extractive document summarization system that uses TextRank, K-means clustering, and one-class SVM algorithms for sentence ranking.
2. The system achieves state-of-the-art performance on the DUC 2002 English dataset with a ROUGE score of 0.4797 and can also summarize Swedish documents.
3. Domain knowledge is added through sentence boosting to improve summarization of news articles, and similarities between sentences are calculated to avoid redundancy for multi-document summarization.
Patch models and sparse decompositions of image patches. Dictionary learning and the k-SVD algorithm. Collaborative filtering and BM3D. Non-local sparse based models. Expected patch log-likelihood. Other applications of patch models in inpainting, super-resolution and deblurring.
Coclustering Base Classification For Out Of Domain Documentslau
This document presents a co-clustering based classification algorithm (CoCC) for classifying documents from a related but different domain (out-of-domain documents) by utilizing labeled documents from another domain (in-domain documents). CoCC aims to simultaneously cluster out-of-domain documents and words to minimize the loss of mutual information, outperforming traditional supervised and semi-supervised algorithms. While CoCC achieved good performance, its time complexity can be inefficient due to the large number of word clusters. Future work will focus on speeding up the algorithm.
Topic Models - LDA and Correlated Topic ModelsClaudia Wagner
The document discusses topic models like Latent Dirichlet Allocation (LDA) and Correlated Topic Models (CTM). LDA is a generative probabilistic model that can discover topics in a collection of documents and represent documents as mixtures over latent topics. CTM extends LDA by allowing topic correlations, since LDA assumes topic independence. CTM models topic proportions using a logistic normal distribution rather than LDA's Dirichlet distribution, allowing dependencies between topics.
This document discusses probabilistic models used for text mining. It introduces mixture models, Bayesian nonparametric models, and graphical models including Bayesian networks, hidden Markov models, Markov random fields, and conditional random fields. It provides details on the general framework of mixture models and examples like topic models PLSA and LDA. It also discusses learning algorithms for probabilistic models like EM algorithm and Gibbs sampling.
Mini-batch Variational Inference for Time-Aware Topic ModelingTomonari Masada
This paper proposes a time-aware topic model that uses two vector embeddings: one for latent topics and one for document timestamps. By combining these embeddings, the model extracts time-dependent word probability distributions for each topic. The paper also proposes a mini-batch variational inference for the model that does not require knowing the total number of documents, allowing efficient processing of large datasets. Evaluation on a paper title dataset showed the model could improve perplexity by using timestamps and was comparable to collapsed Gibbs sampling while being more memory efficient.
The document describes the Correlated Topic Model (CTM), which addresses a limitation of LDA and other topic models by directly modeling correlations between topics. CTM uses a logistic normal distribution over topic proportions instead of a Dirichlet, allowing for covariance structure between topics. This provides a more realistic model of latent topic structure where presence of one topic may be correlated with another. Variational inference is used to approximate posterior inference in CTM. The model is shown to provide a better fit than LDA on a corpus of journal articles.
This document proposes online inference algorithms for topic models as an alternative to traditional batch algorithms. It introduces two related online algorithms: incremental Gibbs samplers and particle filters. These algorithms update estimates of topics incrementally as each new document is observed, making them suitable for applications where the document collection grows over time. The algorithms are evaluated in comparison to existing batch algorithms to analyze their runtime and performance.
Latent dirichletallocation presentationSoojung Hong
Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. LDA represents documents as random mixtures over latent topics, characterized by probability distributions over words. LDA addresses limitations of previous topic models like pLSI by treating topic mixtures as random variables rather than document-specific parameters. Variational inference and EM algorithms are used for parameter estimation in LDA. Empirical results show LDA outperforms other models on tasks like document modeling, classification, and collaborative filtering.
This document describes LSI text clustering. It discusses vector space models, term weighting using TF-IDF, similarity measures, latent semantic indexing using singular value decomposition, suffix arrays and longest common prefix arrays for phrase discovery. The clustering algorithm involves preprocessing text, feature extraction to find terms and phrases, applying LSI to discover concepts and determine cluster labels, assigning documents to clusters, and calculating cluster scores. Parameters and issues with the algorithm are also outlined. A demo clusters a set of question and answer documents.
This document discusses Latent Dirichlet Allocation (LDA), a probabilistic topic modeling technique. It begins with an introduction to topic models and their use in understanding large collections of documents. It then describes LDA's generative process using Dirichlet distributions to represent document-topic and topic-term distributions. Approximate inference methods for LDA like Gibbs sampling are also summarized. The document concludes by outlining the implementation of an LDA model, including preprocessing of documents and collapsed Gibbs sampling.
This document is a thesis submitted by Sihan Chen for a Master's degree in Statistics at the University of Chicago. It compares two topic models - Latent Dirichlet Allocation (LDA) and Von Mises-Fisher (vMF) clustering. LDA uses variational inference to approximate the posterior distribution of topics, while vMF clustering incorporates word embeddings. The thesis experiments with topic assignments, word co-occurrence, and pointwise mutual information to compare the two models.
Representing Documents and Queries as Sets of Word Embedded Vectors for Infor...Dwaipayan Roy
1) The document presents a method to represent documents and queries as sets of word embeddings for information retrieval. It uses word embeddings to create a "Bag of Vectors" representation of documents and queries.
2) Documents are modeled as mixtures of Gaussian distributions centered around the word embeddings. Queries are represented as posterior likelihoods over these Gaussian mixtures.
3) The method is evaluated on several TREC datasets, showing improved retrieval performance over the standard language modeling approach on some datasets, particularly when using k-means clustering to assign words to Gaussian mixtures. The best performance was achieved with 100 Gaussian mixtures.
This document provides an overview of various techniques for text categorization, including decision trees, maximum entropy modeling, perceptrons, and K-nearest neighbor classification. It discusses the data representation, model class, and training procedure for each technique. Key aspects covered include feature selection, parameter estimation, convergence criteria, and the advantages/limitations of each approach.
The document describes Dedalo, a system that automatically explains clusters of data by traversing linked data to find explanations. It evaluates different heuristics for guiding the traversal, finding that entropy and conditional entropy outperform other measures by reducing redundancy and search time. Experiments on authorship clusters, publication clusters, and library book borrowings demonstrate Dedalo's ability to discover explanatory linked data patterns within a limited domain. Future work includes extending Dedalo to handle more complex datasets by addressing issues such as sameAs linking and use of literals.
The document describes several alternative models for information retrieval, including fuzzy set models, extended Boolean models, generalized vector space models, latent semantic indexing models, neural network models, and Bayesian network models. It provides details on fuzzy set models that allow gradual membership in sets, extended Boolean models that combine Boolean queries with vector space characteristics, and Bayesian networks that use directed acyclic graphs and conditional probabilities.
The document describes several alternative models for information retrieval, including fuzzy set models, extended Boolean models, generalized vector space models, latent semantic indexing models, neural network models, and Bayesian network models. It provides details on fuzzy set models that allow gradual membership in sets, extended Boolean models that combine Boolean queries with vector space characteristics, and Bayesian networks that use directed acyclic graphs and conditional probabilities.
This document proposes a construction for quantum hash functions based on classical universal hashing. It defines the concept of a quantum hash generator as a family of functions that maps elements to quantum states. The construction combines the robustness of classical error-correcting codes with the compact representation of quantum systems. It presents two examples of quantum hash functions: one based on binary error-correcting codes and one based on a field of elements. The construction is proved to be optimal in terms of the number of qubits needed to represent the quantum states.
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationTomonari Masada
This document proposes a new inference method for latent Dirichlet allocation (LDA) based on stochastic gradient variational Bayes (SGVB). The proposed method approximates the true posterior using a logistic normal distribution, rather than the Dirichlet distribution used in standard variational Bayes for LDA. Through experiments, the proposed method achieved better predictive performance than standard variational Bayes and collapsed Gibbs sampling on many datasets, demonstrating the effectiveness of SGVB for devising new variational inferences.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
The document presents a novel fuzzy clustering algorithm called FRECCA that clusters sentences from multi-documents to discover new information. FRECCA uses fuzzy relational eigenvector centrality to calculate page rank scores for sentences within clusters, treating the scores as likelihoods. It uses expectation maximization to optimize cluster membership values and mixing coefficients without a parameterized likelihood function. An evaluation shows FRECCA achieves superior performance to other clustering algorithms on a quotations dataset, identifying overlapping clusters of semantically related sentences.
This project uses Latent Dirichlet Allocation (LDA) to build topic models from Twitter data related to two 2014 events - the Ferguson unrest and the killings of two NY police officers. LDA was used to cluster tweets by topic into 6 groups. The trends of these topics and prominent "meme words" within each topic were then tracked over time to analyze how public discussion evolved. Key findings included identifying topics focused on the initial reactions/news reports, increasing public reaction on the second day, and fragmented discussion on the third day as more details emerged.
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users’ interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
Similaire à Context-dependent Token-wise Variational Autoencoder for Topic Modeling (20)
This note explicates some details of the discussion
given in Appendix B of E. Jang, S. Gu, and B. Poole.
Categorical representation with Gumbel-softmax.
ICLR, 2017.
A note on variational inference for the univariate GaussianTomonari Masada
1. The document summarizes variational inference for the univariate Gaussian model.
2. It derives an approximate posterior distribution q(μ, τ) that factorizes into q(μ) and q(τ).
3. q(μ) is Gaussian and q(τ) is Gamma, where their parameters are updated through an iterative procedure that maximizes a lower bound on the log evidence.
Document Modeling with Implicit Approximate Posterior DistributionsTomonari Masada
This document proposes a Bayesian probabilistic model for document modeling that uses an implicit approximate posterior distribution for variational inference. The model generates documents by drawing a noise vector from a standard normal distribution, passing it through a neural network to get parameters for a multinomial distribution, and drawing word counts. Unlike previous models, it uses an implicit distribution approximated by an adversarial training framework to flexibly model the posterior. Evaluation shows the model is comparable to LDA in terms of perplexity.
LDA-Based Scoring of Sequences Generated by RNN for Automatic Tanka CompositionTomonari Masada
This document proposes a method to score sequences generated by RNN for automatic poetry composition using LDA-based topic modeling. It trains an RNN like GRU or LSTM on a corpus of Japanese tanka poems represented as sequences of character bigrams. It then uses LDA to infer topics in the training corpus and assign words in generated sequences to topics. Sequences containing many words assigned to the same topic receive a higher score, promoting diversity in top-ranked sequences compared to scoring based solely on RNN output probabilities. An experiment on tanka generation found the proposed method selected more varied poems.
This document presents the equations and process for deriving the evidence lower bound (ELBO) for a zero-inflated negative binomial variational autoencoder (ZINB-VAE) model. It defines the probability distributions for the ZINB-VAE and breaks the log likelihood function into individual terms. It then uses a variational approximation to the intractable posterior distribution to obtain the ELBO that can be optimized. Monte Carlo sampling is used to approximate the expectation with respect to the variational distribution.
This document summarizes the derivation of an evidence lower bound (ELBO) for latent LSTM allocation, a model that uses an LSTM to determine topic assignments in a topic modeling framework. It expresses the ELBO as terms related to the variational posterior distributions over topics and topics proportions, the generative process of words given topics, and the LSTM's prediction of topic assignments. It also describes how to optimize the ELBO with respect to the variational and LSTM parameters through gradient ascent.
TopicRNN is a generative model for documents that:
1. Draws a topic vector from a standard normal distribution and uses it to generate words in a document.
2. Computes a lower bound on the log marginal likelihood of words and stop word indicators.
3. Approximates the expected values in the lower bound using samples from an inference network that models the approximate posterior distribution over topics.
Topic modeling with Poisson factorization (2)Tomonari Masada
A modified version of the manuscript Published on Feb 3, 2017.
1. Use a gamma prior for $r_k$.
2. Use the same shape parameter $s$ for all gamma distributions.
Topic modeling with Poisson factorization is introduced. The generative model assumes words in documents are generated from topics modeled with Poisson distributions. Variational Bayesian inference is used to approximate the posterior. Update equations are derived for the variational parameters ω, representing topic assignments, α, the Dirichlet prior, and γ, the gamma prior over topic distributions. ω is updated proportionally to functions of α and γ. α is updated based on sums of ω. γ is updated based on sums of ω and the prior shape parameter.
A Simple Stochastic Gradient Variational Bayes for the Correlated Topic ModelTomonari Masada
This document presents a new method for estimating the posterior distribution of the Correlated Topic Model (CTM) using Stochastic Gradient Variational Bayes (SGVB). The CTM is an extension of LDA that models correlations between topics. The proposed method approximates the true posterior of the CTM with a factorial variational distribution and uses SGVB to maximize the evidence lower bound. This allows incorporating randomness into posterior inference for the CTM without requiring explicit inversion of the covariance matrix. Perplexity results on several datasets were comparable to LDA. Future work could explore online learning for topic models using neural networks.
A Simple Stochastic Gradient Variational Bayes for Latent Dirichlet AllocationTomonari Masada
This document proposes applying stochastic gradient variational Bayes (SGVB) to latent Dirichlet allocation (LDA) topic modeling to obtain an efficient posterior estimation. SGVB introduces randomness into variational inference for LDA by estimating expectations with Monte Carlo integration and using reparameterization to sample from approximate posterior distributions. Evaluation on several text corpora shows perplexities comparable to existing LDA inference methods, with the potential for faster parallelization using techniques like GPU processing. Future work will explore applying SGVB to other probabilistic document models like correlated topic models.
This document summarizes a hierarchical topic modeling approach for analyzing traffic speed data. The model treats each day's data as a mixture of different speed distributions. It extends latent Dirichlet allocation to model speeds as continuous gamma distributions rather than discrete words. The model further incorporates metadata on time of day and sensor location to make topic probabilities dependent on context. Model parameters are estimated using variational Bayesian inference, and the model achieves better performance than alternatives by capturing similarity between observations based on timing and location.
A derivation of the sampling formulas for An Entity-Topic Model for Entity Li...Tomonari Masada
A derivation of the sampling formulas for An Entity-Topic Model for
Entity Linking [Han+ EMNLP-CoNLL12]
and
A Context-Aware Topic Model for Statistical Machine Translation [Su+ ACL15]
This document provides an overview of backpropagation through time (BPTT) for long short-term memory (LSTM) language models. It describes the forward and backward passes for LSTM, including equations for calculating the input, forget, output and cell gates, as well as the cell state and hidden state. In the backward pass, it derives the equations for calculating the gradients with respect to the weights and biases at each time step to update the model parameters during training.
The detailed derivation of the derivatives in Table 2 of Marginalized Denoisi...Tomonari Masada
The detailed derivation of the derivatives in Table 2 of Marginalized Denoising Auto-encoders for Nonlinear Representations by M. Chen, K. Weinberger, F. Sha, and Y. Bengio
http://www.cse.wustl.edu/~mchen/papers/deepmsda.pdf
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Context-dependent Token-wise Variational Autoencoder for Topic Modeling
1. Context-dependent Token-wise Variational
Autoencoder for Topic Modeling
Tomonari Masada1[0000−0002−8358−3699]
Nagasaki University, 1-14 Bunkyo-machi, Nagasaki-shi, Nagasaki, Japan
masada@nagasaki-u.ac.jp
Abstract. This paper proposes a new variational autoencoder (VAE)
for topic models. The variational inference (VI) for Bayesian models ap-
proximates the true posterior distribution by maximizing a lower bound
of the log marginal likelihood. We can implement VI as VAE by us-
ing a neural network, called encoder, and running it over observations
to produce the approximate posterior parameters. However, VAE often
suffers from latent variable collapse, where the approximate posterior de-
generates to the local optima just mimicking the prior due to the over-
minimization of the KL-divergence between the approximate posterior
and the prior. To address this problem for topic modeling, we propose a
new VAE. Since we marginalize out topic probabilities by following the
method of Mimno et al., our VAE minimizes a KL-divergence that has
not been considered in the existing VAE. Further, we draw samples from
the variational posterior for each word token separately. This sampling
for Monte-Carlo integration is performed with the Gumbel-softmax trick
by using a document-specific context information. We empirically inves-
tigated if our new VAE could mitigate the difficulty arising from latent
variable collapse. The experimental results showed that our VAE im-
proved the existing VAE for a half of the data sets in terms of perplexity
or of normalized pairwise mutual information.
Keywords: Topic modeling · Deep learning · Variational autoencoder.
1 Introduction
Topic modeling is a well-known text analysis method adopted not only in Web
data mining but also in a wide variety of research disciplines, including social
sciences1
and digital humanities.2
Topic modeling extracts topics from a large
document set. Topics in topic modeling are defined as categorical distribution
over vocabulary words. Each topic is expected to put high probabilities on the
words corresponding to some clear-cut semantic contents delivered by the docu-
ment set. For example, the news articles talking about science may use the words
that are rarely used by the articles talking about politics, and vice versa. Topic
1
https://www.structuraltopicmodel.com/
2
https://www.gale.com/intl/primary-sources/digital-scholar-lab
2. 2 Masada
modeling can explicate such a difference by providing two different word categor-
ical distributions, one putting high probabilities on the words like experiment,
hypothesis, laboratory, etc, and another on the words like government, vote,
parliament, etc. Topic modeling achieves two things at the same time. First, it
provides as topics categorical distributions defined over words as stated above.
Second, topic modeling provides topic proportions for each document. Topic
proportions are expected to be effective as a lower-dimensional representation of
documents, which reflects the difference of their semantic contents.
This paper discusses the inference for Bayesian topic models. The Bayesian
inference aims to infer the posterior distribution. The sampling-based approaches
like MCMC infer the posterior by drawing samples from it. While samples are
drawn from the true posterior, we can only access the posterior through the
drawn samples. This paper focuses on the variational inference (VI), which ap-
proximates the posterior with a tractable surrogate distribution. Our task in VI
is to estimate the parameters of the surrogate distribution. VI maximizes a lower
bound of the log marginal likelihood log p(xd) for every document xd. A lower
bound of log p(xd) is obtained by applying Jensen’s inequality as follows:
log p(xd) = log p(xd|Φ)p(Φ)dΦ = log q(Φ)
p(xd|Φ)p(Φ)
q(Φ)
dΦ
≥ q(Φ) log
p(xd|Φ)p(Φ)
q(Φ)
dΦ = Eq(Φ) log p(xd|Φ) − KL(q(Φ) p(Φ)) ≡ Ld
(1)
where Φ are the model parameters and q(Φ) a surrogate distribution approx-
imating the true posterior. This lower bound is often called ELBO (Evidence
Lower BOund), which we denote by Ld.
In this paper, we consider the variational autoencoder (VAE), a variational
inference using neural networks for approximating the true posterior. To esti-
mate the parameters of the approximate posterior q(Φ) in Eq. (1), we can use a
neural network called encoder as follows. By feeding each observed document xd
as an input to the encoder, we obtain the parameters of the document-dependent
approximate posterior q(Φ|xd) as the corresponding output. The weights and bi-
ases of the encoder network is shared by all documents. Further, we approximate
the ELBO in Eq. (1) by using Monte Carlo integration:
Ld ≈
1
S
S
s=1
log p(xd| ˆΦ(s)
) −
1
S
S
s=1
log
q( ˆΦ(s)
|xd)
p( ˆΦ(s))
(2)
where ˆΦ(s)
are samples drawn from q(Φ|xd). The sampling ˆΦ ∼ q(Φ|xd) can
be performed by using reparameterization trick [10, 18, 20, 24]. For example, if
we choose approximate posteriors from among the multivariate diagonal Gaus-
sian distributions N(µ(xd), σ(xd)) of dimension K, we use an encoder with 2K
outputs, the one half of which are used as mean parameters µ(·) and the other
half as log standard deviation parameters log σ(·). We then obtain a sample
ˆΦ ∼ N(µ(xd), σ(xd)) by using a sample ˆ from the K-dimensional standard
3. Context-dependent Token-wise Variational Autoencoder for Topic Modeling 3
Gaussian distribution as ˆΦ = ˆ σ(xd) + µ(xd), where is the element-wise
multiplication. The reparameterization trick enables us to backpropagate to the
weights and biases defining the encoder network through samples. While Ld in
Eq. (2) is formulated for a particular document xd, we maximize the ELBO for
many documents simultaneously by using mini-batch optimization.
However, VAE often suffers from latent variable collapse, where the approx-
imate posterior q(Φ|xd) degenerates to the local optima just mimicking the
prior p(Φ). When the maximization of Ld in Eq. (1) makes the KL-divergence
term KL(q(Φ) p(Φ)) excessively close to zero, latent variable collapse occurs.
The main contribution of this paper is to provide a new VAE for topic mod-
eling, where we minimize a KL-divergence that has not been considered in the
previous VAE. While we only consider latent Dirichlet allocation (LDA) [1], a
similar proposal can be given also for other topic models. The evaluation ex-
periment, conducted over four large document sets, showed that the proposed
VAE improved the previous VAE for some data sets in terms of perplexity or of
normalized pairwise mutual information (NPMI).
The rest of the paper is organized as follows. Section 2 introduces LDA
and describes the details of our proposal. Section 3 gives the results of the
experiment where we compare our new VAE with other variational inference
methods. Section 4 provides relevant previous work. Section 5 concludes the
paper by suggesting future research directions.
2 Method
2.1 Latent Dirichlet Allocation (LDA)
This subsection introduces LDA [1], for which we propose a new VAE. Assume
that there are D documents X = {x1, . . . , xD}. We denote the vocabulary size
by V and the vocabulary set by {w1, . . . , wV }. Each document xd is represented
as bag-of-words, i.e., a multiset of vocabulary words. The total number of the
word tokens in xd is referred to by nd. Let xd,i be the observable random variable
giving the word appearing as the i-th token in xd. That is, xd,i = wv means that
a token of the word wv appears as the i-th token in xd. Topic modeling assigns
each word token to one among the fixed number of topics. This can be regarded
as a token-level clustering. We denote the number of topics by K and the topic
set by {t1, . . . , tK}. Let zd,i be the latent random variable representing the topic
to which xd,i is assigned. That is, zd,i = tk means that the i-th word token in
xd is assigned to the topic tk.
LDA is a generative model, which generates the data set X as below. First
of all, we assume that a categorical distribution Categorical(βk) is prepared
for each topic tk, where βk,v is the probability of the word wv in the topic tk.
βk = (βk,1, . . . , βk,V ) satisfies v βk,v = 1. LDA generates documents as follows:
– For d = 1, . . . , D:
• A parameter vector θd = (θd,1, . . . , θd,K) of the categorical distribution
Categorical(θd) is drawn from the symmetric Dirichlet prior Dirichlet(α).
4. 4 Masada
• For i = 1, . . . , nd:
∗ A latent topic zd,i is drawn from Categorical(θd)
∗ A word xd,i is drawn from Categorical(βzd,i
).
where θd,k is the probability of tk in xd. In this paper, we apply no prior distri-
bution to the categorical distributions Categorical(βk) for k = 1, . . . , K.
Based on the above generative story, the marginal likelihood (also termed
the evidence) of each document xd for d = 1, . . . , D can be given as follows:
p(xd; α, β) =
zd
p(xd|zd; β)p(zd|θd) p(θd; α)dθd (3)
where p(θd; α) represents the prior Dirichlet(α). The variational inference (VI)
given in [1] maximizes the following lower bound of the log evidence:
log p(xd; α, β) ≥ Eq(zd) log p(xd|zd; β) + Eq(θd)q(zd) log p(zd|θd)
− KL(q(θd) p(θd; α)) − Eq(zd) log q(zd) (4)
This lower bound is obtained by applying Jensen’s inequality as in Eq. (1) and
is often called ELBO (Evidence Lower BOund).
In the VI given by [1], the variational distribution q(θd) is a Dirichlet distribu-
tion parameterized separately for each document. That is, there is no connection
between the variational parameters of different documents. By replacing this per-
document variational posterior with an amortized variational posterior q(θd|xd),
we obtain a variational autoencoder (VAE) for LDA [14, 22]. However, VAE often
leads to a problem called latent variable collapse, where the KL-divergence term
in Eq.(4) is collapsed toward zero. Latent variable collapse makes the approxi-
mate posterior q(θd|xd) almost equivalent to the prior p(θd; α). Consequently,
we cannot obtain informative topic probabilities θd for any document.
2.2 Context-dependent and Token-wise VAE for LDA
The VAE for LDA proposed in [22] avoids latent variable collapse only by tweak-
ing the optimization, i.e., by tuning optimizer parameters, clipping the gradient,
adopting batch normalization, and so on. There are also other proposals using
KL-cost annealing [8, 21] to avoid latent variable collapse. In contrast, we address
the issue by expelling the problematic KL-divergence term KL(q(θd) p(θd; α))
from ELBO. We follow Mimno et al. [15] and marginalize out the per-document
topic proportions θd in Eq. (3) and obtain the following evidence:
p(xd; α, β) =
zd
p(xd|zd; β)p(zd; α) (5)
By marginalizing out θd, we can avoid the approximation of the posterior dis-
tribution over θd. Consequently, we don’t need to estimate the problematic KL-
divergence. However, we now need to approximate the posterior over the topic
5. Context-dependent Token-wise Variational Autoencoder for Topic Modeling 5
assignments zd. The ELBO to be maximized in our case is
log p(xd|α, β) = log
zd
p(xd|β, zd)p(zd|α)
≥ Eq(zd|xd) log p(xd|β, zd) − KL(q(zd|xd) p(zd|α)) (6)
The KL-divergence KL(q(zd|xd) p(zd|α)) our VAE minimizes has not been
considered in the previous VAE for topic modeling. We found in our evaluation
experiment that this KL-divergence required no special treatment for achieving
a good evaluation result. This finding is the main contribution.
The first part of the left hand side of Eq. (6) can be rewritten as
Eq(zd|xd) log p(xd|β, zd) =
Nd
i=1
Eq(zd,i|xd) log βzd,i,xd,i
(7)
where we assume that q(zd|xd) is factorized as
nd
i=1 q(zd,i|xd) and regard each
component q(zd,i|xd) as categorical distribution. In this paper, we parameterize
βk,v as βk,v ∝ exp(ηk,v + η0,v). The variable η0,v is shared by all topics and
roughly represents the background word probabilities. The per-topic parameters
ηk,v and the topic-indifferent parameters η0,v are estimated directly by maxi-
mizing the likelihood in Eq. (7).
The negative KL-divergence in Eq. (6) is given as −KL(q(zd|xd) p(zd|α)) =
Eq(zd|xd) log p(zd|α) − Eq(zd|xd) log q(zd|xd) . The first half is given as follows
by noticing that p(zd|α) is P´olya distribution:
Eq(zd|xd) ln p(zd|α) = Eq(zd|xd) ln
(Nd!)Γ(Kα)
Γ(Nd + Kα)
K
k=1
Γ(nd,k + α)
(nd,k!)Γ(α)
(8)
where nd,k is the number of the word tokens assigned to the topic tk in xd. It
should be noted that nd,k depends on zd. The second half of the KL-divergence
is an entropy term:
−Eq(zd|xd) ln q(zd|xd) = −
Nd
i=1
Eq(zd,i|xd) log q(zd,i|xd) (9)
The main task in our proposed VAE is to estimate the above three expectations
by drawing samples from the approximate posterior q(zd|xd).
Our proposed method models the approximate posterior q(zd|xd) in an amor-
tized manner by using multilayer perceptron (MLP). As already stated, we
assume that q(zd|xd) is factorized as
nd
i=1 q(zd,i|xd) and regard each compo-
nent q(zd,i|xd) as categorical distribution. The parameters of each q(zd,i|xd) for
i = 1, . . . , nd are obtained as follows by using an encoder MLP. Let ev denote
the embedding vector of the vocabulary word wv. Let γd,i = (γd,i,1, . . . , γd,i,K)
denote the parameter vector of the categorical distribution q(zd,i|xd). γd,i,k is the
probability that the i-th token in the document xd is assigned to the topic tk. We
6. 6 Masada
Table 1. Specifications of data sets
Dtrain Dvalid Dtest V
DBLP 2,779,431 346,792 347,291 155,187
PUBMED 5,738,672 1,230,981 1,230,347 141,043
DIET 10,549,675 1,318,663 1,318,805 152,200
NYTimes 239,785 29,978 29,968 102,660
then obtain γd,i as the softmaxed output of the encoder MLP, whose input is the
concatenation of the embedding vector of the i-th word token, i.e., exd,i
, and the
mean of the embedding vectors of all word tokens in xd, i.e., ¯ed ≡ 1
nd
nd
i=1 exd,i
.
That is, γd,i = Softmax(MLP(exd,i
, ¯ed)). To make the posterior inference mem-
ory efficient, we reuse ηk,v, which defines βk,v, as ev,k, i.e., the k-th element of
the embedding vector of the word wv. Such weight tying is also used in [16, 17].
We then draw samples from q(zd,i|xd) ≡ Categorical(γd,i) with the Gumbel-
softmax trick [7]. That is, a discrete sample zd,i ≡ argmaxk gd,i,k + log γd,i,k
is approximated by the continuous sample vector yd,i = (yd,i,1, . . . , yd,1,K), i.e.,
yd,i,k ∝ exp((gd,i,k + log γd,i,k)/τ) for k = 1, . . . , K, where gd,i,1, . . . , gd,i,K are
the samples from the standard Gumbel distribution Gumbel(0, 1) and τ the
temperature. We can obtain nd,k in Eq. (8) as nd,k =
nd
i=1 yd,i,k. It should be
noted that we obtain a different sample for each token. That is, the sampling from
the approximate posterior is performed in a token-wise manner. Further, recall
that our encoder MLP accepts context information, i.e., ¯ed. Therefore, we call our
method context-dependent token-wise variational autoencoder (CTVAE). While
we here propose CTVAE only for LDA, a similar proposal can be made for other
Bayesian topic models. This would be interesting as future work.
3 Evaluation Experiment
3.1 Data Sets and Compared Methods
The evaluation used three large document sets to demonstrate both effectiveness
and scalability. The specifications are given in Table 1, where Dtrain, Dvalid, and
Dtest are the sizes of training set, validation set, and test set, respectively, and
V the vocabulary size. DBLP is a subset of the records downloaded from DBLP
Web site3
on July 24, 2018, where each paper title is regarded as document.
PUBMED is a part of the bag-of-words data set provided by the UC Irvine
Machine Learning Repository.4
DIET data set is a subset of the Diet Record
of both houses of the National Diet of Japan from 1993 to 2012,5
where each
statement by the speaker is regarded as a separate document. Since DIET data
set is given in Japanese, we used MeCab morphological analyzer.6
We also used
the NYTimes part from the above-mentioned beg-of-words data set.
3
https://dblp.uni-trier.de/xml/dblp.xml.gz
4
https://archive.ics.uci.edu/ml/datasets/bag+of+words
5
http://kokkai.ndl.go.jp/
6
http://taku910.github.io/mecab/
7. Context-dependent Token-wise Variational Autoencoder for Topic Modeling 7
Our proposal, CTVAE, and all compared methods were implemented in Py-
Torch.7
The number of topics was fixed to K = 50. All free parameters, including
the number of layers and the layer sizes of the encoder MLP, were tuned based
on the perplexity computed over the validation set. The perplexity is defined
as the exponential of the negative log likelihood of each document, where the
log likelihood is normalized by the number of word tokens. The mean of the
perplexity over all validation documents was used as the evaluation measure for
tuning free parameters. We adopted Adam optimizer [9], where the learning rate
scheduling was also tuned based on the validation perplexity.
We compared our proposal to the following three variational inference meth-
ods: a plain variational autoencoder (VAE), mini-batch variational Bayes (VB),
and adversarial variational Bayes (AVB). Each is explained below.
VAE. We proposed CTVAE in order to avoid latent variable collapse in the
existing VAE. Therefore, we compared our proposal to a plain implementation
of VAE. While a version of VAE for topic modeling has already been proposed
[22], it involves additional elaborations that are not relevant to our compari-
son. Therefore, we implemented VAE in a more elementary manner by following
the neural topic model described in [13]. We adopted the standard multivari-
ate Gaussian distribution as the prior distribution. The topic proportions θd
were obtained as θd ≡ Softmax(ˆ σ(xd) + µ(xd)), where ˆ ∼ N(0, 1). When
maximizing ELBO, we minimized the KL-divergence from N(µ(xd), σ(xd)) to
N(0, 1). The mini-batch optimization with Adam was conducted for training the
encoder MLP, whose output was the concatenation of µ(xd) and log σ(xd), and
also for updating other parameters. While the VAE proposed by [22] avoids latent
variable collapse only by tweaking the optimization, the VAE in our comparison
experiment used KL-cost annealing [21, 8] or path derivative ELBO gradient [19]
along with the learning rate scheduling tuned based on the validation perplexity.
VB. We also compared our proposal to the variational Bayes (VB) for LDA
[1], where no neural network was used. We applied no prior distribution to
the per-topic word categorical distributions and directly estimated the per-topic
word probabilities by maximizing the data likelihood. We ran a mini-batch ver-
sion of VB for estimating the per-topic word probabilities. The per-document
variational parameters were updated for each document separately in the same
manner as [1, 6]. While our implementation was based on [6], the optimization
was performed in a ‘deep learning’ manner. That is, we used Adam optimizer,
whose learning rate scheduling was tuned based on the validation perplexity.
AVB. We additionaly compared CTVAE to adversarial variational Bayes
(AVB) [12]. AVB is a variational inference where we use implicit distribution for
approximating the true posterior. Since the approximate posterior is implicit,
we cannot obtain an explicit formula of its density function and thus cannot
estimate the KL-divergence term in Eq. (1) by the Monte Carlo integration. The
special feature of AVB is that a GAN-like discriminative density ratio estimation
[4] is adopted for the estimation of the density ratio between the prior and the
variational posterior, which is required for the estimation of the KL-divergence.
7
https://pytorch.org/
8. 8 Masada
Table 2. Evaluation results in terms of test set perplexity
CTVAE VAE AVB VB
DBLP 1583.88 1749.64 1326.36 1249.92
PUBMED 3669.79 1072.63 991.43 2958.56
DIET 309.04 364.94 336.33 219.97
NYTimes 4077.60 2893.23 2375.12 3155.75
Since the AVB described in [12] is not specialized to topic modeling, the AVB
proposed for topic modeling [11] was used in our comparison.
3.2 Results
Table 2 includes the perplexity computed over the test set for each compared
method and each data set. A low perplexity indicates that the approximate pos-
terior is good at predicting the word occurrences in the test documents. CTVAE
improved the test set perplexity of VAE both for DBLP and DIET data sets.
Especially, for DIET data set, the perplexity of CTVAE was the second best
among the four compared methods. It can be said that when VAE does not
give any good result, we can consider to use CTVAE as a possible alternative.
AVB gave the best perplexity for PUBMED and NYTimes data sets and was
always better than VAE. However, AVB has two MLPs, i.e., the encoder and
the discriminator, and we need to tune the free parameters of these two MLPs.
Therefore, it is more difficult to train AVB than VAE and CTVAE, because
both VAE and CTVAE have only one MLP, i.e., the encoder. VB gave the best
perplexity for DBLP and DIET data sets. However, VB gave the second worst
perplexity for the other two data sets. Further, VB is not efficient in predict-
ing topic probabilities for unseen documents, because this prediction requires
tens of iterations for updating approximate posterior parameters until conver-
gence. In contrast, VAE, CTVAE, and AVB uses the encoder MLP for predicting
topic probabilities for unseen documents. Therefore, these methods only require
a single forward computation over the encoder for this prediction. As for the
computational efficiency in time, in case of DIET data set, for example, it took
14 hours for CTVAE, 10 hours for AVB, 7 hours for VAE, and 3 hours for VB to
achieve the minimum validation perplexity, respectively. While VAE and AVB
only uses a single sample for Monte Carlo estimation of the expectations with
respect to the approximate posterior, CTVAE uses 10 samples. Consequently,
CTVAE took more hours to train the encoder MLP than VAE and AVB. VB
was the most efficient in time, because it used no neural networks.
We next present the results of the evaluation of the high probability words
in the extracted topics in Table 3. We measured the quality of the high prob-
ability words in each topic with the normalized point-wise mutual information
(NPMI) [2]. NPMI is defined for a pair of words (w, w ) as NPMI(w, w ) =
log p(w,w )
p(w)p(w ) − log p(w, w ). We calculated the probability p(w) as the num-
ber of the test documents containing the word w divided by the total number
9. Context-dependent Token-wise Variational Autoencoder for Topic Modeling 9
Table 3. Evaluation results in terms of test set NPMI
CTVAE VAE AVB VB
DBLP 0.003 0.036 0.062 0.018
PUBMED 0.127 0.123 0.190 0.210
DIET 0.069 0.096 0.079 0.116
NYTimes 0.141 0.105 0.075 0.218
of the test documents and p(w, w ) as the number of the test documents con-
taining both w and w divided by the total number of the test documents.
Table 3 includes NPMI computed over the test set by using the top 20 highest
probability words from each of the 50 topics. We obtain NPMI for each of the
20(20 − 1)/2 = 190 word pairs made by pairing two from the top 20 words and
then calculate the average over all word pairs from all 50 topics. NPMI evaluates
how well the co-occurrence of high probability words in each extracted topic re-
flects the actual word co-occurrence in the test documents. While CTVAE was
worse than VAE in terms of test perplexity for PUBMED and NYTimes data
sets, CTVAE was better in terms of NPMI for these two data sets. Especially,
for NYTimes data set, CTVAE gave the second best NPMI. Table 2 showed
that AVB was always better than VAE in terms of perplexity. However, AVB
improved the NPMI of VAE only for DBLP and PUBMED data sets. Table 3
shows that VB may be the best choice if we are only interested in the quality of
the high probability words of extracted topics and are not interested in the pre-
diction capacity. Finally, Table 4 provides an example of high probability words
obtained by the compared methods for DBLP data set. We selected five topics
from among the 50 topics and presented the top 10 highest probability words
for each of the five topics.
4 Previous Work
While VAE is widely used in the deep learning field, it is still difficult to apply it
to topic modeling, because topic models have discrete latent variables zd,i repre-
senting topic assignments. The reparameterization trick [10, 18, 20, 24] cannot be
straightforwardly applied to discrete variables. Therefore, low-variance gradient
estimation for discrete variables is realized by using some elaborated technique
[25] or the Gumbel-softmax trick [7]. The existing proposals of variational infer-
ence for topic modeling in the deep learning field [3, 13, 22, 26] marginalize out
the discrete variables zd,i and only consider continuous variables. In contrast,
CTVAE marginalizes out the topic probabilities. This marginalizing out is also
performed in the Gibbs sampling [5], the collapsed variational Bayes [23], and
the variational inference proposed by [15]. This paper follows [15] and shows that
we can marginalize out the topic probabilities and then propose a new VAE for
topic modeling. Since CTVAE marginalizes out the topic probabilities, we don’t
need to estimate KL(q(θd) p(θd; α)) in Eq. (4) and thus can avoid the latent
variable collapse the existing VAE [14, 22] have suffered from. The VAE proposed
10. 10 Masada
Table 4. Top 10 highest-probability words in randomly chosen five topics (DBLP)
CTVAE
structure probabilistic game hoc partial bounds project weighted dependent structural
modeling neural energy images object driven processes group code perspective
frequency noise collaborative joint hierarchical open behavior review transmission surface
robots inter convolutional policies symbolic grids stabilization sliding transient play
hybrid graph mining rate features type equations functional some challenges
VAE
graphs equations polynomials equation sets matrices finite numbers number polynomial
real time planning software tool carlo models scheduling monte making
encryption secure privacy key authentication authenticated xml queries query efficient
optimization process bayesian prediction evolutionary making time processes using portfolio
codes channels coding fading decoding modulation mimo video channel ofdm
VB
network based function radial multi using systems dynamic management neural
order delay automata cellular load finite stability quantum higher first
web random services secure privacy games e key de public
controller fpga design generation based next connectivity driven network intelligent
robot artificial time optimal joint real systematic synthesis filter humanoid
AVB
management ad distributed routing editorial wireless hoc network sensor agent
equations nonlinear finite order linear graphs differential equation estimation method
theory finite linear trees minimum reviews codes structure simulation methods
image algorithm fast segmentation detection images coding feature motion video
hoc ad wireless networks radio sensor channel routing energy access
for topic modeling by [22] only tackles latent variable collapse by tweaking the
optimization. In this paper, we proposed a possible way to mitigate the esti-
mation of the problematic KL-divergence. It is also an important contribution
of this paper to prove empirically that the Gumbel-softmax trick works for the
discrete latent variables zd,i in topic modeling.
5 Conclusions
We proposed a new variational autoencoder for topic modeling, called context-
dependent token-wise variational autoencoder (CTVAE), by marginalizing out
the per-document topic probabilities. The main feature of CTVAE was that the
sampling from the variational posterior was performed in a token-wise manner
with respect to the context specified by each document. The context was rep-
resented as the mean of the embedding vectors of the words each document
contains. This feature led to the test perplexity better than the existing VAE
for some data sets. However, the improvement was not that remarkable. It is an
important future research direction to improve the performance of CTVAE. The
experimental results may indicate that we have not yet find a effective way to
train the encoder MLP of CTVAE.
References
1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn.
Res. 3, 993–1022 (2003)
11. Context-dependent Token-wise Variational Autoencoder for Topic Modeling 11
2. Bouma, G.: Normalized (Pointwise) Mutual Information in Collocation Extraction.
In: From Form to Meaning: Processing Texts Automatically, Proceedings of the
Biennial GSCL Conference 2009. pp. 31–40 (2009)
3. Dieng, A.B., Wang, C., Gao, J., Paisley, J.W.: TopicRNN: A recurrent neural
network with long-range semantic dependency. CoRR abs/1611.01702 (2016),
http://arxiv.org/abs/1611.01702
4. Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: Advances in Neural
Information Processing Systems 27 (NIPS). pp. 2672–2680 (2014)
5. Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National
Academy of Sciences 101(suppl 1), 5228–5235 (2004)
6. Hoffman, M.D., Blei, D.M., Bach, F.R.: Online learning for latent Dirichlet al-
location. In: Advances in Neural Information Processing Systems 23 (NIPS). pp.
856–864 (2010)
7. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-softmax.
CoRR abs/1611.01144 (2016), http://arxiv.org/abs/1611.01144
8. Kim, Y., Wiseman, S., Miller, A.C., Sontag, D., Rush, A.M.: Semi-amortized vari-
ational autoencoders. In: Proceedings of the 35th International Conference on Ma-
chine Learning (ICML). pp. 2683–2692 (2018)
9. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. CoRR
abs/1412.6980 (2014), http://arxiv.org/abs/1412.6980
10. Kingma, D.P., Welling, M.: Auto-encoding variational Bayes. CoRR
abs/1312.6114 (2013), http://arxiv.org/abs/1312.6114
11. Masada, T., Takasu, A.: Adversarial learning for topic models. In: Proceedings
of the 14th International Conference on Advanced Data Mining and Applications
(ADMA). pp. 292–302 (2018)
12. Mescheder, L.M., Nowozin, S., Geiger, A.: Adversarial variational Bayes: Unifying
variational autoencoders and generative adversarial networks. In: Proceedings of
the 34th International Conference on Machine Learning (ICML). pp. 2391–2400
(2017)
13. Miao, Y., Grefenstette, E., Blunsom, P.: Discovering discrete latent topics with
neural variational inference. In: Proceedings of the 34th International Conference
on Machine Learning (ICML). pp. 2410–2419 (2017)
14. Miao, Y., Yu, L., Blunsom, P.: Neural variational inference for text processing. In:
Proceedings of the 33rd International Conference on Machine Learning (ICML).
pp. 1727–1736 (2016)
15. Mimno, D., Hoffman, M.D., Blei, D.M.: Sparse stochastic inference for latent
Dirichlet allocation. In: Proceedings of the 29th International Conference on Ma-
chine Learning (ICML). pp. 1515–1522 (2012)
16. Mnih, A., Hinton, G.E.: A scalable hierarchical distributed language model. In:
Advances in Neural Information Processing Systems 21 (NIPS). pp. 1081–1088
(2008)
17. Press, O., Wolf, L.: Using the output embedding to improve language models.
CoRR abs/1608.05859 (2016), http://arxiv.org/abs/1608.05859
18. Rezende, D.J., Mohamed, S., Wierstra, D.: Stochastic backpropagation and ap-
proximate inference in deep generative models. In: Proceedings of the 31st Inter-
national Conference on Machine Learning (ICML). pp. II–1278–II–1286 (2014)
19. Roeder, G., Wu, Y., Duvenaud, D.K.: Sticking the landing: Simple, lower-variance
gradient estimators for variational inference. In: Advances in Neural Information
Processing Systems 30 (NIPS). pp. 6928–6937 (2017)
12. 12 Masada
20. Salimans, T., Knowles, D.A.: Fixed-form variational posterior approximation
through stochastic linear regression. CoRR abs/1206.6679 (2012), http://
arxiv.org/abs/1206.6679
21. Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K., Winther, O.: Ladder vari-
ational autoencoders. In: Advances in Neural Information Processing Systems 29
(NIPS). pp. 3738–3746 (2016)
22. Srivastava, A., Sutton, C.: Autoencoding variational inference for topic models.
CoRR abs/1703.01488 (2017), http://arxiv.org/abs/1703.01488
23. Teh, Y.W., Newman, D., Welling, M.: A collapsed variational Bayesian inference
algorithm for latent Dirichlet allocation. In: Advances in Neural Information Pro-
cessing Systems 19 (NIPS). pp. 1353–1360 (2006)
24. Titsias, M.K., L´azaro-Gredilla, M.: Doubly stochastic variational Bayes for non-
conjugate inference. In: Proceedings of the 31st International Conference on Ma-
chine Learning (ICML). pp. II–1971–II–1980 (2014)
25. Tokui, S., Sato, I.: Reparameterization trick for discrete variables. CoRR
abs/1611.01239 (2016), http://arxiv.org/abs/1611.01239
26. Wang, W., Gan, Z., Wang, W., Shen, D., Huang, J., Ping, W., Satheesh, S., Carin,
L.: Topic compositional neural language model. In: Proceedings of the 21st Interna-
tional Conference on Artificial Intelligence and Statistics (AISTATS). pp. 356–365
(2018)