Final.Version

Using the Wisdom of the Crowd to Help Improve
Catering Service
Fabio Amaral
17 November 2015
Introduction
While checking the Yelp star rating and user reviews of a business is a great way for assessing the general
user perception of a service’s quality, it is still a quite subjective task and requires a thorough evaluation of
the written reviews available if one would like to learn the reasons for a business being considered a success
or a failure.
Therefore the aims of this project was to utilize the text of available Yelp reviews to try to predict the star
ratings given by the reviewer and to try to learn in the process the topics that the customers care the most.
Techniques of topic modelling were used for this purpose and the information learned should be useful for
helping users to make better informed choices more easily and for business owners and managers to identify
potential service improvement opportunities.
Methods and Data
The data set used in this project was downloaded in Json format from the link provided, which comprises of
information about local businesses in 10 cities across 4 countries. Detailed information about the data and
an ongoing competition of which this data set is part of can be found in Yelp Dataset Challenge webpage.
The R packages jsonlite, plyr, dplyr, stringr, ff, ffbase were used for parsing the data set into a standard
data.frame. The available data contains a total of 1,569,264 reviews from 61,184 unique businesses from a
wide variety of service types such as bookstores, building contractors and drugstores. Restaurants comprise
around 36 % of the businesses in this data set, therefore we focused on this category in an attempt to extract
more specific and actionable information from reviews.
The variables text (reviews), review counts, star (business), star (review) were extracted as presented
in the original data set and a few others were created to assist in the rating prediction and extraction of
relevant information from the reviews. The variable delta star was created to represent the rating variation
of the review rating in relation to the provided business rating (review star - business star). A similar
categorical variable review effect with the levels positive, negative and neutral was created to indicate
the effect of the review on the business rating (e.g. if review star - business star > 0 then review
effect = positive). The variable sentiment was created by sentiment analysis of the review text using
the function polarity from the package qdap and has a value range from -0.77720 to 0.99740 indicating how
negative or positive the words in each review are (Figure 1d).
Given the multinational nature of the data set, a number of reviews were written in languages other than
English with the most evident ones being German (435), French (334) and Spanish (9). These reviews were
automatically translated via the Microsoft Translator API using the R package translateR. The packages tm,
slam, SnowballC were used for manipulating text and creating a sparce matrix of word counts for bag-of-words
modelling. The package tau was used for generating word and n-grams frequency analysis.
The package lda was utilized for performing rating prediction via topic modelling with supervised latent
Dirichlet allocation sLDA as introduced in Blei and McAuliffe 2010 and interactive visualizations were
created with the package LDAvis. Structural topic model estimation was further performed using the package
stm by including relevant metadata information such as the review related variables aforementioned as
introduced by Roberts et al 2015 and interactive visualization was created with the package stmBrowser.
1

The topic modeling techniques used in this project are modified versions of the more general latent Dirichlet
allocation which views each document as mixture of topics formed by latent probability of some terms being
present. The topics are said to form a mixed membership of term, that is a term can appear in more than
one topic with different probabilities. In the supervised algorithm version used in this work linear regression
is used to predict the labeled review rating using the infered topic/terms coefficients as predictors.
Results
Upon inspection of the review star ratings distribution (figure 1a) we can observe an imbalance dominated by
positive reviews ratings between 5 and 3 stars with much less frequent 1 and 2 stars with a mean 3.7 and
median 4. This skewed distribution bias is not observed for the cumulative business ratings (figure 2b) which
is approximately normal with mean 3.48 and median 3.5. The distribution of the variation of review rating in
relation of the cumulative business rating is also approximately normal (mean = 0.22 and median = 0.5)
with a bit more reviews bellow the cumulative business rating than above. However a sentiment analysis
of the review texts indicate that the overall sentiment balance of the reviews is markedly positive. 83.56 %
reviews are considered positive, 11.33 % reviews are considered negative and 5.11 % reviews are considered
neutral. The discrepancy between the rating variations and sentiment score distributions highlights the
complex challenge of integrating the objective metrics of star rating with the subjectivity of written review
texts for the purpose of making predictions.
1
2
3
4
5
a) Review Ratings
Density
StarRating
0.00 0.10 0.20 0.30
1
1.5
2
2.5
3
3.5
4
4.5
5
b) Business Ratings
Density
StarRating
0.00 0.05 0.10 0.15 0.20 0.25
c) Rating Variation
Review Star − Business Star
Density
−3 −2 −1 0 1 2 3
0.00.10.20.30.4
−0.5 0.0 0.5 1.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
d) Sentiment Score Distribution
Sentiment Score
Probability(RelativeFrequency)
Figure 1. Exploratory analysis of busines/review ratings and sentiments score distributions
Similarly to the original LDA algorithm, the structure of topics modeled by sLDA is mostly influenced by
the number of arbitrarily selected topics (K) and the initialization prior parameter of the Dirichlet on the
per-document topic distributions (alpha) and the response parameter (eta). Therefore a large combination
of values for these parameters was used for producing multiple models of which the combination (K=40,
alpha=1 and eta=0.1) that resulted in the lowest root mean squared error (RMSE = 0.8764) was selected for
further analysis (figure 2).
1.0
1.2
1.4
1.6
1.8
0 10 20 30 40 50
trial
RMSE
variable
K=5
K=6
K=7
K=8
K=9
K=10
K=11
K=12
K=13
K=14
K=15
Best Tuning Parameters (K = 5−15)
K=15 alpha=1 eta=0.1 RMSE=0.8927
0.88
0.90
0.92
0.94
0.96
1 2 3 4
trial
RMSE
variable
K=10
K=15
K=20
K=25
K=30
K=35
K=40
K=45
K=50
K=55
K=60
K=65
Best Tuning Parameters (K = 10−65)
K=40 alpha=1 eta=0.1 RMSE=0.8764
Figure 2. Tuning the parameter of sLDA for improved rating prediction measured by RMSE.
Figure 3 shows a summary result of the analysis of correlation between the most frequent words from the
modeled 40 topics and the estimated review rating. Since regression is used in sLDA we get a wider range
of estimated ratings than 1 to 5 stars for the combination of frequent words from each topic. As shown in
2

the legends, the sizes of each spot represent the t-value of the regression coefficients. We can see from this
analysis that topics involving friendliness, location, option variety of desert and fresh salads and restaurant
grand openings are mostly associated with the very high review ratings. On the other hand the food taste
quality, extended service waiting times and waiters mistakes are the major causes of customer dissatisfaction.
bad didnt food tasted bland
order minutes time service waiting
good pretty place decent bad
table ordered waitress asked didnt
bit make dont meat places
drive fast order mcdonalds window
counter kids order dont line
area place eat part small
im people youre dont theyre
food service stars fast quality
sauce salad red fresh flavor
nice inside tables side small
chicken fried rice wings sauce
nice good menu tea back
review star place time im
bbq meat pork potato sweet
breakfast coffee eggs cafe morning
bar beer great hour drinks
night music bar late fun
chinese soup beef rice pork
sandwich sandwiches bread cheese subway
pizza italian crust cheese pasta
steak dinner meal restaurant appetizer
fries burger cheese burgers bacon
back wasnt didnt give good
mexican tacos taco salsa burrito
lunch great quick prices food
wine dining room list menu
sushi fish roll shrimp seafood
food buffet thai dishes restaurant
montreal st french la restaurant
fresh options salads menu delicious
hot dog make style dogs
location staff friendly lot great
opening staff free open grand
cream ice dessert chocolate delicious
vegas restaurant great strip recommend
great place food family love
im time day place friend
ive love place awesome amazing
−10 −5 0 5
Estimate
Topics
abs(t.value)
20
40
60
80
−10
−5
0
5
Estimate
Figure 3. Review rating estimation associated with the 40 modeled topics.
Figure 4 shows how the estimated ratings are distributed in relation to the review ratings actually given
by the Yelp users. We can see an evident improved estimation for higher ratings as would be expected
from an imbalanced data set with regards to the numbers of reviews with each star ratings. The out-of-bag
estimation RMSE on the training set was of 0.8579872 and 0.8842336 on the test set with an R-squared value
of 0.4428283. This is a much better estimation than that of a baseline model that would classify all reviews
as the most frequent rating (4 stars) which would result in an RMSE of 1.2225789 and R-squared value of
-0.0651453.
Attempts to address the issue of review rating imbalance were made but with little success. A balanced
training set was created by randomly selecting 1,000 reviews of each rating but this had very little to no
effect on the prediction improvement of lower ratings. Another attempted solution that seemed to reduce the
overlap between the estimated ratings was by balancing the train set by randomly sampling with replacement
reviews with 1, 2, 3 and 5 ratings so they would have the same number of reviews as for rating 4 stars
(nearly 5,000 reviews each) after the train and test set had already been split. This action helped separating
better, especially the lower ratings, but caused a strong artificial bias towards the negative words in the topic
modeling and therefore was not considered for further analysis.
0.00
0.25
0.50
0.75
1.00
0 1 2 3 4 5 6
predicted rating
density
actual
review rating
1
2
3
4
5
Figure 4. Predicted rating distributions
The package LDAvis was used to assist in the exploration and a representative image of a topic (desert),
highly related to high ratings reviews, can be viewed on figure 5. Feel free to explore the topics and their
3

related terms by clicking on the figure itself or on this link. The principal component analysis on the left
illustrate how close a topic is from each other and by clicking on each numbered topic spot the term frequency
bar plot on the right is adjusted accordingly for that topic. By hovering over the term labels of the bar plot
its possible to see in which topic each term can appear since they may be part of multiple topics. The relative
relevance of the terms in relation to each topic can be further tuned by changing the value of lambda with
the sliding knob on the top right hand corner.
Some of top predictor topics in relation to the estimate ratings (as per figure 3) are number 30 (good
impressions), 36 (Las Vegas), 3 (deserts), 15 (grand openings), 33 (locations), 34 (hot dog), 39 (Montreal), 20
(Fresh salads). Some of the topics related to the worst rating estimates are 7 (waiting times), 10 (issues with
the food), and 35 (fast food drive through).
Figure 5 Visualization of the supervise LDA analysis created with the LDAvis package. Click on this link or
on the figure above to open the analysis in a web browser for interactive visualization.
Other metadata information such as sentiment scores, rating variation between review and cumulative
business rating (delta.star), complete selected review documents were further explored by structural topic
modeling and visualized with the help of the stmBrowser package and a representative result of a positive
topic correlation with increased star rating can be viewed in figure 6. To access the interactive plot please
click on the figure itself of on this link.
To reproduce figure 6 please select the following options from the respective drop-down menus: X-axis =
delta-star, Y-axis = Topic 7, Radius = review.star, Color = sentiment. Each plotted spot represent one
sampled review document which can be clicked to read its text.
By making use of the delta.star metrics one can more easily isolate individual reviews associated with each
topic to better focus on what most influenced the user experience on the occasion of writing the review. It is
particularly informative to focus on the reviews which rating most vary from the cumulative business rating.
We can verify and confirm the relevance of the topics within the review texts, the themes most associated
to the reduction or improvement of a business rating. Figure 6 shows a very representative selected review
4

commenting on the quality of the food and the friendly and efficient service which had been all identified by
sLDA topic modeling as strong predictors of higher review ratings.
Figure 6 Visualization of the Structural Topic Model created with the stmBrowser package. Click on this
link or on the figure above to open the analysis in a web browser for interactive visualization.
Discussion
The difference between the success and the failure of a business or a good or bad customer experience can be
down to details that simple star ratings cannot inform. The availability of user reviews provides the means of
identifying what the are the main reasons for the customer satisfaction. With the ever growing numbers of
reviews an automated system to analyse them and isolate actionable information becomes very important.
We have seen that topic modeling techniques can be very effective on this task and an implementation of
such analysis on review sites such as Yelp could further increase the usefulness of its rich information data
bases both for business owners/manages as well as their customers.
We have been able to learn that in catering business an attentive and friendly service is a key indicator
of a good customer satisfaction in the catering business as well as having a good selection of tasty food,
fresh salads and vegetables and deserts, especially ice cream. A similar approach could be applied for other
business classes to be able to offer equivalent insights.
One possible limitation of the analysis presented in this study was the imbalance between review ratings
with much less reviews with 1 and 2 stars than above. It might be possible to further improve such analysis
by using a more balanced data set or by using some terms weighting scheme to better calibrate the rating
predictions.
5

Final.Version

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (8)

Similaire à Final.Version

Similaire à Final.Version (20)

Final.Version