Data Mining Specialization - Capstone Project - Task 1

•

1 j'aime•861 vues

Marco Antonio Gonzalez Junior

University of Illinois at Urbana-Champaign - Data Mining Specialization - Capstone Project - Task 1

Données & analyses

DATA MINING CAPSTONE
Data Set Exploration (Task 1)
Marco Antonio Gonzalez Junior - September 2015

Introduction
Yelp is a corporation which develops, hosts and market Yelp.com website and
mobile application. Its purpose is to connect people with great local businesses by
publishing crowd-sourced reviews. Yelp users have written about 83 million reviews by
the end of 2015's second quarter. This research's goal is to explore the Yelp dataset in
order to get a sense about what the data looks like and their characteristics. This is
accomplished using techniques such as text representation, TF-IDF weighting, topic
modeling, frequent patterns discovery and visualization.
Topic Modeling
A topic modelis a kind of statistical model used to describe abstract topics that
occur in a collection of documents. Some particular words are expected to appear in a
document more or less frequently as a strong indication of the document's subject.
MeTAis a data sciences toolkit written in C++. It is a collection of tools and algorithms for
text mining and natural language processing.
In this task Latent Dirichlet Allocation (LDA) was chosen for natural language processing.
LDA is a generative model that allows sets of observations to be explained by
unobserved groups that explain why some parts of the data are similar.
The Yelp dataset provided in form of JSON files was processed using Node.js and
MongoDB. Using NPM (Node Package Manager), Mongoose mapper, Winston logger and
pure JavaScript to make queries and generate the appropriate text files to be processed
by the MeTa toolkit.
2

Task 1.1
Using LDA against the whole dataset, all the reviews texts, extracted twelve topics of
fifteen words.
It's possible to clearly identify some topics like about the quality of the service (Topic 5
and 4), kinds of food (Topics 0, 2, 8 and 11), condiments (Topic 9) and places.
3

Task 1.2
Using LDA against two subsets of data.
Beverages
Analysing topics generated from reviews extracted from two kinds of beverages: juices
and teas.
4

Cuisines
Analysing topics generated from reviews extracted from two kinds of cuisines:
thailandese and vegan.
5

Ratings
Analysing topics generated from reviews with low ratings, three of fewer stars, and high
ratings, above four starts. The topics shows clear difference between high and low
ratings. The high rated reviews contains words like enjoyed, delicious, fresh, polite.
6

Contenu connexe

Similaire à Data Mining Specialization - Capstone Project - Task 1

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics

Data Science - Part XI - Text AnalyticsDerek Kane

Benchmarking nlp toolkits for enterprise applicationConference Papers

Named Entity Recognition from Online NewsBernardo Najlis

Module 9: Natural Language Processing Part 2Sara Hooker

Natural Language Processing .pdfAnime196637

Natural Language Processing Use Cases for Business OptimizationTakayuki Yamazaki

Beyond Post-Editing: The Work of the eBay MTLSJose Luis Bonilla Sánchez

Beyond Post-Editing - How the eBay MTLS Reinvent the Linguist´s RoleJose Luis Bonilla Sánchez

BEA 2015 Generating Metadata by Machine FinalS. M. Hassan Zaidi

16 Decision Support and Business Intelligence Systems (9th E.docxRAJU852744

16 Decision Support and Business Intelligence Systems (9th E.docxherminaprocter

Introduction to Named Entity RecognitionTomer Lieber

DeepSearch_Project_ReportUrjit Patel

Proven ETL Developer Interview Questions to Assess and Hire ETL DevelopersInterview Mocha

AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位eLearning Consortium 電子學習聯盟

Text Analysis and Semantic Search with GATEDiana Maynard

Natural language processing and searchNathan McMinn

NLP_A Chat-Bot_answering_queries_of_UT-Dallas_StudentsHimanshu kandwal

Text analysis and Semantic Search with GATEDiana Maynard

Similaire à Data Mining Specialization - Capstone Project - Task 1 (20)

NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx

Data Science - Part XI - Text Analytics

Benchmarking nlp toolkits for enterprise application

Named Entity Recognition from Online News

Module 9: Natural Language Processing Part 2

Natural Language Processing .pdf

Natural Language Processing Use Cases for Business Optimization

Beyond Post-Editing: The Work of the eBay MTLS

Beyond Post-Editing - How the eBay MTLS Reinvent the Linguist´s Role

BEA 2015 Generating Metadata by Machine Final

16 Decision Support and Business Intelligence Systems (9th E.docx

Introduction to Named Entity Recognition

DeepSearch_Project_Report

Proven ETL Developer Interview Questions to Assess and Hire ETL Developers

AI生成工具的新衝擊 - MS Bing & Google Bard 能否挑戰ChatGPT-4領導地位

Text Analysis and Semantic Search with GATE

Natural language processing and search

NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students

Text analysis and Semantic Search with GATE

Dernier

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

Anomaly detection and data imputation within time seriesParis Women in Machine Learning and Data Science

Week-01-2.ppt BBB human Computer interactionfulawalesam

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Midocean dropshipping via API with DroFxolyaivanovalion

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823

April 2024 - Crypto Market Report's Analysismanisha194592

Mature dropshipping via API with DroFx.pptxolyaivanovalion

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823

Dernier (20)

Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...

Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

Anomaly detection and data imputation within time series

Week-01-2.ppt BBB human Computer interaction

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Midocean dropshipping via API with DroFx

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...

➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...

April 2024 - Crypto Market Report's Analysis

Mature dropshipping via API with DroFx.pptx

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Sampling (random) method and Non random.ppt

Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...

Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...

Data Mining Specialization - Capstone Project - Task 1

1. DATA MINING CAPSTONE Data Set Exploration (Task 1) Marco Antonio Gonzalez Junior - September 2015

2. Introduction Yelp is a corporation which develops, hosts and market Yelp.com website and mobile application. Its purpose is to connect people with great local businesses by publishing crowd-sourced reviews. Yelp users have written about 83 million reviews by the end of 2015's second quarter. This research's goal is to explore the Yelp dataset in order to get a sense about what the data looks like and their characteristics. This is accomplished using techniques such as text representation, TF-IDF weighting, topic modeling, frequent patterns discovery and visualization. Topic Modeling A topic modelis a kind of statistical model used to describe abstract topics that occur in a collection of documents. Some particular words are expected to appear in a document more or less frequently as a strong indication of the document's subject. MeTAis a data sciences toolkit written in C++. It is a collection of tools and algorithms for text mining and natural language processing. In this task Latent Dirichlet Allocation (LDA) was chosen for natural language processing. LDA is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. The Yelp dataset provided in form of JSON files was processed using Node.js and MongoDB. Using NPM (Node Package Manager), Mongoose mapper, Winston logger and pure JavaScript to make queries and generate the appropriate text files to be processed by the MeTa toolkit. 2

3. Task 1.1 Using LDA against the whole dataset, all the reviews texts, extracted twelve topics of fifteen words. It's possible to clearly identify some topics like about the quality of the service (Topic 5 and 4), kinds of food (Topics 0, 2, 8 and 11), condiments (Topic 9) and places. 3

4. Task 1.2 Using LDA against two subsets of data. Beverages Analysing topics generated from reviews extracted from two kinds of beverages: juices and teas. 4

5. Cuisines Analysing topics generated from reviews extracted from two kinds of cuisines: thailandese and vegan. 5

6. Ratings Analysing topics generated from reviews with low ratings, three of fewer stars, and high ratings, above four starts. The topics shows clear difference between high and low ratings. The high rated reviews contains words like enjoyed, delicious, fresh, polite. 6

Data Mining Specialization - Capstone Project - Task 1

Recommandé

Recommandé

Contenu connexe

Similaire à Data Mining Specialization - Capstone Project - Task 1

Similaire à Data Mining Specialization - Capstone Project - Task 1 (20)

Dernier

Dernier (20)

Data Mining Specialization - Capstone Project - Task 1