Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Online news popularity analysis
1. WEB ANALYTICS -
ONLINE NEWS POPULARITY
TEAM – 11
KRUTIKA DEDHIA
KINJAL GADA
ANKUR VORA
ADVANCES IN DATA SCIENCES AND ARCHITECTURE
- PROF. SRIKANTH KRISHNAMURTHY
2. INTRODUCTION
• The dataset summarizes a set of features about articles published by Mashable,
a well-known news website over a period of two years.
• The objective is to predict the number of shares depending on the features if the
article to be published would be popular on the internet or no.
3. GOALS
• Create and evaluate regression, classification and clustering models in Microsoft
Azure Machine Learning Studio.
• Deploy the models as a web service to generate a REST API.
• Build the interactive web interface to predict the results.
4. DATASET
• Data Source : UCI ML Repository
https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
• Number of attributes: 61
• Number of records: 39,645
• Dependent variable: Number of shares
5. DATA MODIFICATION
• Type of Data : 1 – business, 2 – lifestyle, 3 – entertainment, 4 - social media, 5 –
technology, 6 – world
• Extracted the date from the URL column.
• Day of week : 0 – Sunday, 1 – Monday, 2 – Tuesday, 3 – Wednesday, 4 –
Thursday, 5 – Friday, 6 – Saturday
• Web Scraping : Topics, Channel, Author
6. PROCESS
• Created training models for regression, classification and clustering in Azure ML.
• Created predictive experiment for the above trained models.
• Deployed the models as a web service and generated a REST API.
• Designed UI using Java Spring MVC, HTML, Bootstrap, Ajax along with user
validations.
8. REGRESSION MODELS
• Used Azure ML regression modules
• Decision Forest, Neural Network, Poisson Regression and Boosted Decision Tree
• Best Model: Random Forest based on lowest RMSE value
10. CLASSIFICATION MODELS
• Used Azure ML classification components Two Class Decision Forest, Two Class
Neural Network and Two Class Boosted Decision Tree
• Added attribute isPopular :
• Shares <= 1400 : high popular
• Shares > 1400 : less popular
• Best Model : Two Class Boosted Decision Tree Based on the high Accuracy and
AUC value
12. CLUSTERING MODELS
• Used K-means Clustering
• No of clusters used is 3 (k = 3).
• Determines the distance of articles based on a few parameters from the centroid
of clusters.
18. CHALLENGES
• Formatting data after Web Scraping.
• Understanding the variables like keywords, subjectivity.
• Finding relation between variables and feature selection for modelling.