The Yelp Dataset consists of 1.6M reviews by customers for 61K businesses. There are three tasks accomplished using this dataset:-
1. Assign categories to businesses based on customer reviews
2. Recommend food items and services of a restaurant based on reviews
3. Determine Influential factors in a city affecting restaurants
2. Project Tasks
Task 1
Assign Categories to Business in the Yelp Data Set
Task 2
Recommend Food Items and/or services in a Restaurant
Determine Influential Factors in a City affecting Restaurants
4. Task 1 : Methodology
Business Business
To To
Review Category
Map Map
…...
…...
Tf-Idf
1. Default
2. BM25
3. Dirichlet
Lucene
Index
Lucene
Index
Mapping Phase
Category to
Review Mapping
Predicted
Categories
Training
Set
Testing
Set
10. Feature Extraction
Every token has an associated POS
tag
POS tag with “NN” are Nouns and
“JJ” are adjectives
Nouns are considered as features
and adjectives as sentiments
11. Feature Filtering
Noise present in features obtained from Feature Extraction Phase
Using Task 1 Solution, categories of input features are determined
Features whose categories are related to restaurants are considered for further
processing
Before Feature Filtering After Feature Filtering
• cheese
• burger
• ones
• menu
• combinations
• idea
• commission
• cheese
• burger
• menu
13. New Review
Adjective
Positive or
Negative?
Negative
Word in 4-
word
distance?
Decision
(Recommended
or not
Recommended)
Classification of reviews
1. For each sentence the noun is extracted
through feature extraction
2. Corresponding adjective is identified as
positive or negative
3. Negation is searched for within 4 word
distance of adjective
4. Feature is classified as Recommended if
number of positive sentiments associated
with it is more than the number of
negative sentiments
All the above steps are repeated for each
review
14. Sample Result
Predicted
Features
Predicted Feature Sentiments Predicted as
Recommended
Features ?
Actual
Recommended
Features
sub, next, decent Y Y
bread flavorful, bland, fresh, great, nice Y Y
peppercorn nice Y Y
stuff-it chewy Y N
sandwich mayo/mustard/vinegar, east, good,
unknown
Y Y
menu decent Y Y
bacon real Y Y
bite huge Y N
veggies sorry N Y
15. Evaluation
Set 1 - Recommended Features are obtained from
60% reviews of a particular restaurant.
Set 2 - The remaining 40% of the reviews are
considered for testing
If a recommended feature from Set 1 is present as a
recommended feature in Set 2, then it is a True
Positive
Evaluation Metrics
Precision
Recall
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Precision Recall
0.53
0.67
16. Identifying Influential topics
“Identify features from reviews which are relevant city wide
and influence the user’s choice and restaurant’s popularity”
Phases
I. Business classification by city
II. Popular item word-count
III. NLP feature extraction
IV. Feature re-ranking model
V. Model fitness evaluation
17. Business Classification Phase I
Issue: Reviews specify neighborhood not city. (~150 !!!)
Solution:
1. Identify city based on geo-code through mapping service.
2. K-means clustering
1. Data point features (Business Id, Latitude, Longitude)
2. Dissimilarity metric (Euclidian distance)
3. Cluster count: k (10)
4. Centroid Labeling
3. Data persistence and indexing
1. Split reviews based on clustered business ids
2. Save & index for next phase.
18.
19. Word-count Phase II
Issue: How do we get the influential factors of a city
Solution: Word count as first pass
Observation: Noise (adjectives, verbs, expressions)
Proposal: Include features derived through NLP
20. NLP Features Phase III
Issue: Noise reduction and contextual awareness
Solution: Use NLP to identify features in the reviews
Observation: Subtle change in ordering of words
Proposal: Re-ranking the words using metrics from user and review.
25. Mathematical Formula
Features from NLP does take in account word count and context but does
NOT consider user weight and review weight
Program with
Mathematical
Formula
Solr
Index
Word list
from NLP
Top 1K
Relevant
Reviews
Scored word
26. User
Review Count = Urc
Average Stars
Votes = Uv
Friends
Elite = Ue
Yelping Since
Compliments
Fans = Uf
Mathematical Formula
Uvnorm =
UTotalVotes
UReviewCount
(𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓
𝑼 𝒗
𝑼 𝒓𝒄
+ 𝟎. 𝟐𝟎 . 𝐔𝐟)
Normalization of votes
User
Review Count = Urc
Average Stars
Votes = Uv
Friends
Elite = Ue
Yelping Since
Compliments
Fans = Uf
User Review Count Votes
U1 10 1000
U2 1000 1000
27. User
Review Count = Urc
Average Stars
Votes = Uv
Friends
Elite = Ue
Yelping Since
Compliments
Fans = Uf
Mathematical Formula
𝟎. 𝟏𝟓. 𝑹𝒗 + 𝟎. 𝟏𝟓. 𝑹𝒔 + 𝟎. 𝟕. (𝟎. 𝟐𝟓. 𝐔𝐞 + 𝟎. 𝟓𝟓
𝑼 𝒗
𝑼 𝒓𝒄
+ 𝟎. 𝟐𝟎 . 𝐔𝐟)
Review
User
Stars = Rs
Text
Date
Votes = Rv
User
Stars Sentiment
1 Very Strong
2 Inclined -ve
3 Ambivalent
4 Inclined +ve
5 Very Strong
30. Madison
Rank
Wordcount
List
NLP list-
Unformatted
NLP list-
Model
1 food food pizza
2 place beer cheese
3 like cheese coffee
4 from menu breakfast
5 service curds burger
6 go atmosphere taco
7 time burger sushi
8 madison dane chocolate
9 been drinks beer
10 cheese beers sandwich
11 menu restaurant curds
12 bar table ice
13 restaurant coffee wine
14 ordered pizza store
15 love something cream
16 order sandwich lunch
17 chicken dinner rolls
18 beer lunch atmosphere
19 pizza meal tea
20 sauce sauce curries
21 night burgers steak
22 people drink noodle
23 make bread spot
24 staff server soup
25 made chicken egg
Rank Wordcount List
NLP list-
Unformatted NLP list- Model
1 food food donut
2 good pizza bagel
3 place burger cupcake
4 great menu gelato
5 like restaurant gyro
6 service fries yogurt
7 time atmosphere buffet
8 go chicken boba
9 back patio pizza
10 from breakfast sushi
11 been table coffee
12 love lunch sub
13 ordered dinner wing
14 chicken meal crepe
15 nice salad burger
16 order cheese burrito
17 restaurant potato taco
18 little server cookie
19 menu something gluten
20 pizza sauce breakfast
21 bar drinks coffee-shop
22 delicious rice hash-brown
23 friendly burgers cake
24 first beer Vegan
25 Pretty Spot Teas
Pheonix Las Vegas
Rank Wordcount
NLP list-
Unformatted NLP list- Model
1food food donuts
2good beer bagel
3place sushi crepe
4like restaurant pizza
5great meal oyster
6service menu yogurt
7from atmosphere shrimp
8time table burger
9vegas steak gelato
10go dinner sushi
11back server wings
12ordered salad sandwich
13restaurant tables pancake
14nice rib coffee
15been buffet burrito
16order dining curry
17chicken breakfast buffet
18little waitress waffle
19pretty shrimp chocolate
20love something cake
21menu beers breakfast
22eat dishes tea
23delicious dish cookies
24first restaurants gluten
25people sauce pastrami
31. Evaluation Metric: NDCG
Predicted topics for Phoenix under categories: Bakery, Breakfast and
Brunch
To capture the strongest sentiments about these topics, we analyzed the
top 1000 features for businesses under predicted under Bakery, Breakfast
and Brunch for the specific city, in this case Phoenix.
Using these features as input for relevance score, we analyze the top 30
topics predicted by the model:
NDCG = 18.80190835 / 21.8978282
= 0.8586