SlideShare a Scribd company logo
1 of 5
Download to read offline
SHIVAM PAWAR 5492083
1
Data and Data Preprocessing
Problem 1: Types of attributes
Q 1) Classify the following attributes as nominal, ordinal, interval, ratio:
(a) Rating of an Amazon product by a person on a scale of 1 to 5 – Ordinal
Ordinal measurement holds importance of the position. So here rating a product will keep
that product in a place as per the rating provided by an individual. It holds importance to
the value. Hence, we should use Ordinal scale of measurement for this case.
(b) The Internet Speed – Interval
The reason for using Interval here is because it does not have a true zero point. Adding the
internet speed of 2 devices does not mean the speed has been increased.
(c) Number of customers in a store – Ratio
For this we have to use Ratio measurement as we are considering the number of people
inside a store where the count can be increased, decreased which further will create a
difference with the change.
(d) UCF Student ID – Nominal
Here the Nominal will count the students and does not hold any importance to the value in
terms of academic position.
(e) Distance – Ratio
For measuring distance, we have to use Ratio scale as the distance can be increased or
decreased. Here for distance, there will be true zero point as adding distance will create
difference.
(f) Letter grade (A, B, C, D) – Ordinal
Here as grading something holds importance of the value, we need to take Ordinal
measurement. A grade is considered higher than B in academic standards.
(g) The temperature at Orlando – Interval
For measuring temperature, we need to use Interval scale of measurement as 0 degrees
does not mean an absence of the property and doubling a degree will not make any
difference.
SHIVAM PAWAR 5492083
2
Problem 2: Exploring Data Pre-processing Techniques :
Q1) (Reproduce): Please read, understand, run the code and reproduce the model accuracies.
Please briefly explain whether you can reproduce the classification accuracies of 'Support Vector
Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic
Gradient Decent', 'Linear SVC', 'Decision Tree'.
In the given Kaggle Titanic Dataset the workflows they have followed are Classifying,
Correlating, Converting, Completing, Correcting, Creating and Charting in order to process the data
using algorithms. The main aim of this Dataset is to find the survival rate. Initially they have taken
‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and
‘Embarked’ as features to categorize the data for better idea. Later after running few scenarios, they
have removed few features like Fare, Ticket and Cabin as removing them will not create any
difference for finding the Survival rate.
I tried to reproduce the code with the same machine learning models and could be able to
see the same accuracies for all the algorithms for every run except for the Stochastic Gradient
Decent. Because the Stochastic Gradient Decent is an iterative algorithm which takes the data sets
randomly for each iteration. So as the datasets this algorithm will get varied differently each time
making the algorithm to display different score for each run.
Sample accuracies for the algorithms are as below:
Sample 1:
Random Forest - 86.76
Decision Tree - 86.76
KNN - 74.47
Support Vector Machines - 83.84
Logistic Regression - 80.36
Linear SVC - 79.12
Perceptron - 78.00
Naive Bayes - 72.28
Stochastic Gradient Decent - 51.63
SHIVAM PAWAR 5492083
3
Sample 2:
Random Forest - 86.76
Decision Tree - 86.76
KNN - 74.47
Support Vector Machines - 83.84
Logistic Regression - 80.36
Linear SVC - 79.12
Stochastic Gradient Decent - 78.68
Perceptron - 78.00
Naive Bayes - 72.28
Q2) (Improve): Is the data pre-processing process proposed in the Kaggle post the best pre-
processing solution? If yes, please explain why. If not, can you leverage what you learned in the
class and your previous experiences to improve data processing, to obtain better accuracies for all
these classification models? Describe what is your improved data pre-processing, and what are
your improved accuracies?
As stated above in the first question the algorithms and data processing techniques used are
very well written in the given Kaggle Titanic Dataset. As the workflow follows six steps. After
understanding or defining the problem we need to acquire the training and testing data. Then we
have to prepare and cleanse the data. Now we have to analyse the data and explore the data. Now
we need to predict the possible situations/scenarios to solve the problem which will further supplies
the result.
Here in this Dataset the work flow techniques have been started with some features as
mentioned in the above question (‘PassergerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’,
‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and ‘Embarked’). Later after improvising the data few features like
Ticket, Fare, Cabin, Embarked, Parch have been dropped to increase the accuracy of the algorithms.
Also added other features like AgeBand, IsAlone which improvised the code.
The technique I have used is to change the values in AgeBand which increased the accuracy
from 86.68 to 90.46 for the algorithms. Previously they have given higher difference to the age
values given AgeBand later I have decreased them and added new values which will enable the code
to run faster with high accuracy.
Below is the sample of the accuracy of the algorithms post making the changes in the code.
SHIVAM PAWAR 5492083
4
Sample 1:
Random Forest - 90.46
Decision Tree - 90.46
KNN - 87.09
Support Vector Machines - 85.63
Perceptron - 80.02
Linear SVC - 78.79
Logistic Regression - 78.45
Naive Bayes - 77.89
Stochastic Gradient Decent - 74.97
Sample 2:
Random Forest - 89.56
Decision Tree - 89.56
KNN - 87.21
Support Vector Machines - 85.07
Linear SVC - 78.45
Logistic Regression - 78.23
Perceptron - 78.23
Stochastic Gradient Decent - 77.89
Naive Bayes - 77.67
In the Sample-2 I have made changes again in AgeBand which further displayed the above accuracy.
Below is the link for Sample-2:
https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions/edit
https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions
SHIVAM PAWAR 5492083
5
Problem 3: Distance/Similarity Measures
Given the four boxes shown in the following figure, answer the following questions. In the
diagram, numbers indicate the lengths and widths and you can consider each box to be a vector of
two real numbers, length and width. For example, the top left box would be (2,1), while the
bottom right box would be (3,3). Restrict your choices of similarity/distance measure to Euclidean
distance and correlation
Which proximity measure would you use to group the boxes based on their shapes (length-width
ratio)?
For measuring the boxes based on the shapes (length-width ratio) we need to use
Corelation. Below is the formula to measure the Corelation.
Where n = 2 as we are comparing two sets
For the values of x and y we need to take the values simultaneously as we compare 2 conditions.
Sigma x and Sigma y would be 3 and 2
Corelation for box 1 and box 2 comes as 0 after calculating the above values in the corelation
formula which is the smallest distance.
Corelation for box 1 and box 3 comes as 1
Corelation for box 1 and box 4 approximately equals to 1(0.9)
Similarly for box 2 and box 4 the corelation is equal to 1
And the boxes 2 and 3 will be the same as 0
Which proximity measure would you use to group the boxes based on their size?
Based on the size of the boxes we need to use Euclidean formula.
If we calculate the values of the boxes like below
Box 1(2,1) ; Box 2(1,1); Box 3(6,3) ; Box 4(3,3)
If we substitute the above values of all boxes in to the formula then we will get the answer.
We will get the smallest distance for box 1 and box 2 as 1. Also for the box 2 and box 4 we will get
the smallest as approximately equal to 3.

More Related Content

Similar to HW1 assignment Shivam.pdf

Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classificationSnehaDey21
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdfBeyaNasr1
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET Journal
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptxssuser31398b
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Simplilearn
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONIRJET Journal
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesVarad Meru
 
svm-proyekt.pptx
svm-proyekt.pptxsvm-proyekt.pptx
svm-proyekt.pptxElinEliyev
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with YellowbrickRebecca Bilbro
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationHariniMS1
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search EngineVirenKhandal
 
The Beginnings of a Search Engine
The Beginnings of a Search EngineThe Beginnings of a Search Engine
The Beginnings of a Search EngineVirenKhandal
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsDinusha Dilanka
 

Similar to HW1 assignment Shivam.pdf (20)

Guide
GuideGuide
Guide
 
Heart disease classification
Heart disease classificationHeart disease classification
Heart disease classification
 
Machine Learning.pdf
Machine Learning.pdfMachine Learning.pdf
Machine Learning.pdf
 
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
IRJET- Evaluation of Classification Algorithms with Solutions to Class Imbala...
 
casestudy_important.pptx
casestudy_important.pptxcasestudy_important.pptx
casestudy_important.pptx
 
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 2 | Machine Learning Tutorial For Beginners ...
 
German credit data analysis
German credit data analysisGerman credit data analysis
German credit data analysis
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
 
07 learning
07 learning07 learning
07 learning
 
Predicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensemblesPredicting rainfall using ensemble of ensembles
Predicting rainfall using ensemble of ensembles
 
svm-proyekt.pptx
svm-proyekt.pptxsvm-proyekt.pptx
svm-proyekt.pptx
 
forest-cover-type
forest-cover-typeforest-cover-type
forest-cover-type
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Practical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and PresentationPractical Data Science: Data Modelling and Presentation
Practical Data Science: Data Modelling and Presentation
 
The Beginnings Of A Search Engine
The Beginnings Of A Search EngineThe Beginnings Of A Search Engine
The Beginnings Of A Search Engine
 
The Beginnings of a Search Engine
The Beginnings of a Search EngineThe Beginnings of a Search Engine
The Beginnings of a Search Engine
 
Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM )
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 

Recently uploaded

9990611130 Find & Book Russian Call Girls In Vijay Nagar
9990611130 Find & Book Russian Call Girls In Vijay Nagar9990611130 Find & Book Russian Call Girls In Vijay Nagar
9990611130 Find & Book Russian Call Girls In Vijay NagarGenuineGirls
 
Vip Mumbai Call Girls Mira Road Call On 9920725232 With Body to body massage ...
Vip Mumbai Call Girls Mira Road Call On 9920725232 With Body to body massage ...Vip Mumbai Call Girls Mira Road Call On 9920725232 With Body to body massage ...
Vip Mumbai Call Girls Mira Road Call On 9920725232 With Body to body massage ...amitlee9823
 
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一ozave
 
What Could Cause Your Subaru's Touch Screen To Stop Working
What Could Cause Your Subaru's Touch Screen To Stop WorkingWhat Could Cause Your Subaru's Touch Screen To Stop Working
What Could Cause Your Subaru's Touch Screen To Stop WorkingBruce Cox Imports
 
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...amitlee9823
 
Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...
Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...
Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...Delhi Call girls
 
Why Won't Your Subaru Key Come Out Of The Ignition Find Out Here!
Why Won't Your Subaru Key Come Out Of The Ignition Find Out Here!Why Won't Your Subaru Key Come Out Of The Ignition Find Out Here!
Why Won't Your Subaru Key Come Out Of The Ignition Find Out Here!AutoScandia
 
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp NumberVip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Numberkumarajju5765
 
Chapter-1.3-Four-Basic-Computer-periods.pptx
Chapter-1.3-Four-Basic-Computer-periods.pptxChapter-1.3-Four-Basic-Computer-periods.pptx
Chapter-1.3-Four-Basic-Computer-periods.pptxAnjieVillarba1
 
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
如何办理女王大学毕业证(QU毕业证书)成绩单原版一比一
如何办理女王大学毕业证(QU毕业证书)成绩单原版一比一如何办理女王大学毕业证(QU毕业证书)成绩单原版一比一
如何办理女王大学毕业证(QU毕业证书)成绩单原版一比一opyff
 
John Deere Tractors 6130M 6140M Diagnostic Manual
John Deere Tractors  6130M 6140M Diagnostic ManualJohn Deere Tractors  6130M 6140M Diagnostic Manual
John Deere Tractors 6130M 6140M Diagnostic ManualExcavator
 
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理ezgenuh
 
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111Sapana Sha
 
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfSales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfAggregage
 
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Saket 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
John Deere 335 375 385 435 Service Repair Manual
John Deere 335 375 385 435 Service Repair ManualJohn Deere 335 375 385 435 Service Repair Manual
John Deere 335 375 385 435 Service Repair ManualExcavator
 
Top Rated Call Girls Mumbai Central : 9920725232 We offer Beautiful and sexy ...
Top Rated Call Girls Mumbai Central : 9920725232 We offer Beautiful and sexy ...Top Rated Call Girls Mumbai Central : 9920725232 We offer Beautiful and sexy ...
Top Rated Call Girls Mumbai Central : 9920725232 We offer Beautiful and sexy ...amitlee9823
 
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 

Recently uploaded (20)

9990611130 Find & Book Russian Call Girls In Vijay Nagar
9990611130 Find & Book Russian Call Girls In Vijay Nagar9990611130 Find & Book Russian Call Girls In Vijay Nagar
9990611130 Find & Book Russian Call Girls In Vijay Nagar
 
Vip Mumbai Call Girls Mira Road Call On 9920725232 With Body to body massage ...
Vip Mumbai Call Girls Mira Road Call On 9920725232 With Body to body massage ...Vip Mumbai Call Girls Mira Road Call On 9920725232 With Body to body massage ...
Vip Mumbai Call Girls Mira Road Call On 9920725232 With Body to body massage ...
 
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
如何办理麦考瑞大学毕业证(MQU毕业证书)成绩单原版一比一
 
What Could Cause Your Subaru's Touch Screen To Stop Working
What Could Cause Your Subaru's Touch Screen To Stop WorkingWhat Could Cause Your Subaru's Touch Screen To Stop Working
What Could Cause Your Subaru's Touch Screen To Stop Working
 
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
Sanjay Nagar Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalor...
 
Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...
Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...
Call Girls in Malviya Nagar Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts Ser...
 
Why Won't Your Subaru Key Come Out Of The Ignition Find Out Here!
Why Won't Your Subaru Key Come Out Of The Ignition Find Out Here!Why Won't Your Subaru Key Come Out Of The Ignition Find Out Here!
Why Won't Your Subaru Key Come Out Of The Ignition Find Out Here!
 
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp NumberVip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
Vip Hot Call Girls 🫤 Mahipalpur ➡️ 9711199171 ➡️ Delhi 🫦 Whatsapp Number
 
Chapter-1.3-Four-Basic-Computer-periods.pptx
Chapter-1.3-Four-Basic-Computer-periods.pptxChapter-1.3-Four-Basic-Computer-periods.pptx
Chapter-1.3-Four-Basic-Computer-periods.pptx
 
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Vikaspuri 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
如何办理女王大学毕业证(QU毕业证书)成绩单原版一比一
如何办理女王大学毕业证(QU毕业证书)成绩单原版一比一如何办理女王大学毕业证(QU毕业证书)成绩单原版一比一
如何办理女王大学毕业证(QU毕业证书)成绩单原版一比一
 
John Deere Tractors 6130M 6140M Diagnostic Manual
John Deere Tractors  6130M 6140M Diagnostic ManualJohn Deere Tractors  6130M 6140M Diagnostic Manual
John Deere Tractors 6130M 6140M Diagnostic Manual
 
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
一比一原版(PU学位证书)普渡大学毕业证学历认证加急办理
 
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
ENJOY Call Girls In Okhla Vihar Delhi Call 9654467111
 
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdfSales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
Sales & Marketing Alignment_ How to Synergize for Success.pptx.pdf
 
(ISHITA) Call Girls Service Jammu Call Now 8617697112 Jammu Escorts 24x7
(ISHITA) Call Girls Service Jammu Call Now 8617697112 Jammu Escorts 24x7(ISHITA) Call Girls Service Jammu Call Now 8617697112 Jammu Escorts 24x7
(ISHITA) Call Girls Service Jammu Call Now 8617697112 Jammu Escorts 24x7
 
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Saket 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Saket 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
John Deere 335 375 385 435 Service Repair Manual
John Deere 335 375 385 435 Service Repair ManualJohn Deere 335 375 385 435 Service Repair Manual
John Deere 335 375 385 435 Service Repair Manual
 
Top Rated Call Girls Mumbai Central : 9920725232 We offer Beautiful and sexy ...
Top Rated Call Girls Mumbai Central : 9920725232 We offer Beautiful and sexy ...Top Rated Call Girls Mumbai Central : 9920725232 We offer Beautiful and sexy ...
Top Rated Call Girls Mumbai Central : 9920725232 We offer Beautiful and sexy ...
 
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Mayur Vihar 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 

HW1 assignment Shivam.pdf

  • 1. SHIVAM PAWAR 5492083 1 Data and Data Preprocessing Problem 1: Types of attributes Q 1) Classify the following attributes as nominal, ordinal, interval, ratio: (a) Rating of an Amazon product by a person on a scale of 1 to 5 – Ordinal Ordinal measurement holds importance of the position. So here rating a product will keep that product in a place as per the rating provided by an individual. It holds importance to the value. Hence, we should use Ordinal scale of measurement for this case. (b) The Internet Speed – Interval The reason for using Interval here is because it does not have a true zero point. Adding the internet speed of 2 devices does not mean the speed has been increased. (c) Number of customers in a store – Ratio For this we have to use Ratio measurement as we are considering the number of people inside a store where the count can be increased, decreased which further will create a difference with the change. (d) UCF Student ID – Nominal Here the Nominal will count the students and does not hold any importance to the value in terms of academic position. (e) Distance – Ratio For measuring distance, we have to use Ratio scale as the distance can be increased or decreased. Here for distance, there will be true zero point as adding distance will create difference. (f) Letter grade (A, B, C, D) – Ordinal Here as grading something holds importance of the value, we need to take Ordinal measurement. A grade is considered higher than B in academic standards. (g) The temperature at Orlando – Interval For measuring temperature, we need to use Interval scale of measurement as 0 degrees does not mean an absence of the property and doubling a degree will not make any difference.
  • 2. SHIVAM PAWAR 5492083 2 Problem 2: Exploring Data Pre-processing Techniques : Q1) (Reproduce): Please read, understand, run the code and reproduce the model accuracies. Please briefly explain whether you can reproduce the classification accuracies of 'Support Vector Machines', 'KNN', 'Logistic Regression', 'Random Forest', 'Naive Bayes', 'Perceptron', 'Stochastic Gradient Decent', 'Linear SVC', 'Decision Tree'. In the given Kaggle Titanic Dataset the workflows they have followed are Classifying, Correlating, Converting, Completing, Correcting, Creating and Charting in order to process the data using algorithms. The main aim of this Dataset is to find the survival rate. Initially they have taken ‘PassengerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and ‘Embarked’ as features to categorize the data for better idea. Later after running few scenarios, they have removed few features like Fare, Ticket and Cabin as removing them will not create any difference for finding the Survival rate. I tried to reproduce the code with the same machine learning models and could be able to see the same accuracies for all the algorithms for every run except for the Stochastic Gradient Decent. Because the Stochastic Gradient Decent is an iterative algorithm which takes the data sets randomly for each iteration. So as the datasets this algorithm will get varied differently each time making the algorithm to display different score for each run. Sample accuracies for the algorithms are as below: Sample 1: Random Forest - 86.76 Decision Tree - 86.76 KNN - 74.47 Support Vector Machines - 83.84 Logistic Regression - 80.36 Linear SVC - 79.12 Perceptron - 78.00 Naive Bayes - 72.28 Stochastic Gradient Decent - 51.63
  • 3. SHIVAM PAWAR 5492083 3 Sample 2: Random Forest - 86.76 Decision Tree - 86.76 KNN - 74.47 Support Vector Machines - 83.84 Logistic Regression - 80.36 Linear SVC - 79.12 Stochastic Gradient Decent - 78.68 Perceptron - 78.00 Naive Bayes - 72.28 Q2) (Improve): Is the data pre-processing process proposed in the Kaggle post the best pre- processing solution? If yes, please explain why. If not, can you leverage what you learned in the class and your previous experiences to improve data processing, to obtain better accuracies for all these classification models? Describe what is your improved data pre-processing, and what are your improved accuracies? As stated above in the first question the algorithms and data processing techniques used are very well written in the given Kaggle Titanic Dataset. As the workflow follows six steps. After understanding or defining the problem we need to acquire the training and testing data. Then we have to prepare and cleanse the data. Now we have to analyse the data and explore the data. Now we need to predict the possible situations/scenarios to solve the problem which will further supplies the result. Here in this Dataset the work flow techniques have been started with some features as mentioned in the above question (‘PassergerId’, ‘Survived’, ‘Pclass’, ‘Name’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘Ticket’, ‘Fare’, ‘Cabin’ and ‘Embarked’). Later after improvising the data few features like Ticket, Fare, Cabin, Embarked, Parch have been dropped to increase the accuracy of the algorithms. Also added other features like AgeBand, IsAlone which improvised the code. The technique I have used is to change the values in AgeBand which increased the accuracy from 86.68 to 90.46 for the algorithms. Previously they have given higher difference to the age values given AgeBand later I have decreased them and added new values which will enable the code to run faster with high accuracy. Below is the sample of the accuracy of the algorithms post making the changes in the code.
  • 4. SHIVAM PAWAR 5492083 4 Sample 1: Random Forest - 90.46 Decision Tree - 90.46 KNN - 87.09 Support Vector Machines - 85.63 Perceptron - 80.02 Linear SVC - 78.79 Logistic Regression - 78.45 Naive Bayes - 77.89 Stochastic Gradient Decent - 74.97 Sample 2: Random Forest - 89.56 Decision Tree - 89.56 KNN - 87.21 Support Vector Machines - 85.07 Linear SVC - 78.45 Logistic Regression - 78.23 Perceptron - 78.23 Stochastic Gradient Decent - 77.89 Naive Bayes - 77.67 In the Sample-2 I have made changes again in AgeBand which further displayed the above accuracy. Below is the link for Sample-2: https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions/edit https://www.kaggle.com/code/nikhithakonda/titanic-data-science-solutions
  • 5. SHIVAM PAWAR 5492083 5 Problem 3: Distance/Similarity Measures Given the four boxes shown in the following figure, answer the following questions. In the diagram, numbers indicate the lengths and widths and you can consider each box to be a vector of two real numbers, length and width. For example, the top left box would be (2,1), while the bottom right box would be (3,3). Restrict your choices of similarity/distance measure to Euclidean distance and correlation Which proximity measure would you use to group the boxes based on their shapes (length-width ratio)? For measuring the boxes based on the shapes (length-width ratio) we need to use Corelation. Below is the formula to measure the Corelation. Where n = 2 as we are comparing two sets For the values of x and y we need to take the values simultaneously as we compare 2 conditions. Sigma x and Sigma y would be 3 and 2 Corelation for box 1 and box 2 comes as 0 after calculating the above values in the corelation formula which is the smallest distance. Corelation for box 1 and box 3 comes as 1 Corelation for box 1 and box 4 approximately equals to 1(0.9) Similarly for box 2 and box 4 the corelation is equal to 1 And the boxes 2 and 3 will be the same as 0 Which proximity measure would you use to group the boxes based on their size? Based on the size of the boxes we need to use Euclidean formula. If we calculate the values of the boxes like below Box 1(2,1) ; Box 2(1,1); Box 3(6,3) ; Box 4(3,3) If we substitute the above values of all boxes in to the formula then we will get the answer. We will get the smallest distance for box 1 and box 2 as 1. Also for the box 2 and box 4 we will get the smallest as approximately equal to 3.