HW1 assignment Shivam.pdf

SHIVAM PAWAR 5492083
1
Data and Data Preprocessing
Problem 1: Types of attributes
Q 1) Classify the following attributes as nominal, ordinal, interval, ratio:
(a) Rating of an Amazon product by a person on a scale of 1 to 5 – Ordinal
Ordinal measurement holds importance of the position. So here rating a product will keep
that product in a place as per the rating provided by an individual. It holds importance to
the value. Hence, we should use Ordinal scale of measurement for this case.
(b) The Internet Speed – Interval
The reason for using Interval here is because it does not have a true zero point. Adding the
internet speed of 2 devices does not mean the speed has been increased.
(c) Number of customers in a store – Ratio
For this we have to use Ratio measurement as we are considering the number of people
inside a store where the count can be increased, decreased which further will create a
difference with the change.
(d) UCF Student ID – Nominal
Here the Nominal will count the students and does not hold any importance to the value in
terms of academic position.
(e) Distance – Ratio
For measuring distance, we have to use Ratio scale as the distance can be increased or
decreased. Here for distance, there will be true zero point as adding distance will create
difference.
(f) Letter grade (A, B, C, D) – Ordinal
Here as grading something holds importance of the value, we need to take Ordinal
measurement. A grade is considered higher than B in academic standards.
(g) The temperature at Orlando – Interval
For measuring temperature, we need to use Interval scale of measurement as 0 degrees
does not mean an absence of the property and doubling a degree will not make any
difference.

2
Problem 2: Exploring Data Pre-processing Techniques :
Q1) (Reproduce): Please read, understand, run the code and reproduce the model accuracies.
Please briefly explain whether you can reproduce the classification accuracies of 'Support Vector
Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic
Gradient Decent', 'Linear SVC', 'Decision Tree'.
In the given Kaggle Titanic Dataset the workflows they have followed are Classifying,
Correlating, Converting, Completing, Correcting, Creating and Charting in order to process the data
using algorithms. The main aim of this Dataset is to find the survival rate. Initially they have taken
‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and
‘Embarked’ as features to categorize the data for better idea. Later after running few scenarios, they
have removed few features like Fare, Ticket and Cabin as removing them will not create any
difference for finding the Survival rate.
I tried to reproduce the code with the same machine learning models and could be able to
see the same accuracies for all the algorithms for every run except for the Stochastic Gradient
Decent. Because the Stochastic Gradient Decent is an iterative algorithm which takes the data sets
randomly for each iteration. So as the datasets this algorithm will get varied differently each time
making the algorithm to display different score for each run.
Sample accuracies for the algorithms are as below:
Sample 1:
Random Forest - 86.76
Decision Tree - 86.76
KNN - 74.47
Support Vector Machines - 83.84
Logistic Regression - 80.36
Linear SVC - 79.12
Perceptron - 78.00
Naive Bayes - 72.28
Stochastic Gradient Decent - 51.63

3
Sample 2:
KNN - 74.47
Linear SVC - 79.12
Perceptron - 78.00
Naive Bayes - 72.28
Q2) (Improve): Is the data pre-processing process proposed in the Kaggle post the best pre-
processing solution? If yes, please explain why. If not, can you leverage what you learned in the
class and your previous experiences to improve data processing, to obtain better accuracies for all
these classification models? Describe what is your improved data pre-processing, and what are
your improved accuracies?
As stated above in the first question the algorithms and data processing techniques used are
very well written in the given Kaggle Titanic Dataset. As the workflow follows six steps. After
understanding or defining the problem we need to acquire the training and testing data. Then we
have to prepare and cleanse the data. Now we have to analyse the data and explore the data. Now
we need to predict the possible situations/scenarios to solve the problem which will further supplies
the result.
Here in this Dataset the work flow techniques have been started with some features as
mentioned in the above question (‘PassergerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’,
‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and ‘Embarked’). Later after improvising the data few features like
Ticket, Fare, Cabin, Embarked, Parch have been dropped to increase the accuracy of the algorithms.
Also added other features like AgeBand, IsAlone which improvised the code.
The technique I have used is to change the values in AgeBand which increased the accuracy
from 86.68 to 90.46 for the algorithms. Previously they have given higher difference to the age
values given AgeBand later I have decreased them and added new values which will enable the code
to run faster with high accuracy.
Below is the sample of the accuracy of the algorithms post making the changes in the code.

4
Sample 1:
KNN - 87.09
Perceptron - 80.02
Linear SVC - 78.79
Naive Bayes - 77.89
Sample 2:
KNN - 87.21
Linear SVC - 78.45
Perceptron - 78.23
Naive Bayes - 77.67
In the Sample-2 I have made changes again in AgeBand which further displayed the above accuracy.
Below is the link for Sample-2:
https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions/edit
https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions

5
Problem 3: Distance/Similarity Measures
Given the four boxes shown in the following figure, answer the following questions. In the
diagram, numbers indicate the lengths and widths and you can consider each box to be a vector of
two real numbers, length and width. For example, the top left box would be (2,1), while the
bottom right box would be (3,3). Restrict your choices of similarity/distance measure to Euclidean
distance and correlation
Which proximity measure would you use to group the boxes based on their shapes (length-width
ratio)?
For measuring the boxes based on the shapes (length-width ratio) we need to use
Corelation. Below is the formula to measure the Corelation.
Where n = 2 as we are comparing two sets
For the values of x and y we need to take the values simultaneously as we compare 2 conditions.
Sigma x and Sigma y would be 3 and 2
Corelation for box 1 and box 2 comes as 0 after calculating the above values in the corelation
formula which is the smallest distance.
Corelation for box 1 and box 3 comes as 1
Corelation for box 1 and box 4 approximately equals to 1(0.9)
Similarly for box 2 and box 4 the corelation is equal to 1
And the boxes 2 and 3 will be the same as 0
Which proximity measure would you use to group the boxes based on their size?
Based on the size of the boxes we need to use Euclidean formula.
If we calculate the values of the boxes like below
Box 1(2,1) ; Box 2(1,1); Box 3(6,3) ; Box 4(3,3)
If we substitute the above values of all boxes in to the formula then we will get the answer.
We will get the smallest distance for box 1 and box 2 as 1. Also for the box 2 and box 4 we will get
the smallest as approximately equal to 3.

HW1 assignment Shivam.pdf

Recommended

Recommended

More Related Content

Similar to HW1 assignment Shivam.pdf

Similar to HW1 assignment Shivam.pdf (20)

Recently uploaded

Recently uploaded (20)

HW1 assignment Shivam.pdf