2. Introduction
Yelp is a corporation which develops, hosts and market Yelp.com website and
mobile application. Its purpose is to connect people with great local businesses by
publishing crowd-sourced reviews. Yelp users have written about 83 million reviews by
the end of 2015's second quarter. This research's goal is to explore the Yelp dataset in
order to get a sense about what the data looks like and their characteristics. This is
accomplished using techniques such as text representation, TF-IDF weighting, topic
modeling, frequent patterns discovery and visualization.
Topic Modeling
A topic modelis a kind of statistical model used to describe abstract topics that
occur in a collection of documents. Some particular words are expected to appear in a
document more or less frequently as a strong indication of the document's subject.
MeTAis a data sciences toolkit written in C++. It is a collection of tools and algorithms for
text mining and natural language processing.
In this task Latent Dirichlet Allocation (LDA) was chosen for natural language processing.
LDA is a generative model that allows sets of observations to be explained by
unobserved groups that explain why some parts of the data are similar.
The Yelp dataset provided in form of JSON files was processed using Node.js and
MongoDB. Using NPM (Node Package Manager), Mongoose mapper, Winston logger and
pure JavaScript to make queries and generate the appropriate text files to be processed
by the MeTa toolkit.
2
3. Task 1.1
Using LDA against the whole dataset, all the reviews texts, extracted twelve topics of
fifteen words.
It's possible to clearly identify some topics like about the quality of the service (Topic 5
and 4), kinds of food (Topics 0, 2, 8 and 11), condiments (Topic 9) and places.
3
4. Task 1.2
Using LDA against two subsets of data.
Beverages
Analysing topics generated from reviews extracted from two kinds of beverages: juices
and teas.
4
6. Ratings
Analysing topics generated from reviews with low ratings, three of fewer stars, and high
ratings, above four starts. The topics shows clear difference between high and low
ratings. The high rated reviews contains words like enjoyed, delicious, fresh, polite.
6