Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Data Mining Specialization Capstone Project - Task 2
1. University of Illinois at Urbana-Champaign
Data Mining Specialization: Capstone Project
Marco Antonio Gonzalez Junior
September, 2015
Task 2 Report
Cuisine Clustering and Map Construction
1. Visualization of the Cuisine Map
The purpose of this section is to compute and visualise the similarity between cuisines.
The computation is based on their review texts. The output is a similarity matrix where each
cell corresponds to the similarity between a pair of cuisines. The opacity of each cell indicates
the level of similarity, the higher opacity, higher is the similarity.
A subset of the data was used. The criteria of selection is to process only the the review
about country specific cuisines. The whole dataset provided contains over one hundred
subjects and it is not feasible to compare all of them on a single matrix. So only files named
with country specific subjects were processed. A few examples are American, Argentine,
Brazilian, Greek, Chinese, French, German, Italian, Mexican, Japanese and so on.
The approach to obtain the similarity was to use Python to do topic modelling and
extract the 10 most important topics of each cuisine through LDA. Each file was processed
generating a new file with the same name, name of the cuisine, on another folder. This new
file contains the topic modelling for each country cuisine. These files were compared one
against each in order to other to compute the similarity between them. The technique used
was Cosine Similarity.
The results are shown in the Figure 1. The opacity means the level of similarity between
the cuisines. Higher opacity indicates higher similarity.
3. 2. Improving the Cuisine Map
Varying the similarity function by first computing the similarity of each individual
review and then aggregating the similarity values improved the accuracy of similarity as
shown in Figure 2.
Figure 2: Improved visualisation of sample cuisines