The document provides an introduction to the Orange data mining and visualization tool. It discusses what data mining is and its major tasks, including classification, clustering, deviation detection, forecasting, and description. It also lists major industries that use data mining, such as retail, finance, education, and healthcare. The document then introduces Orange, describing it as an open-source, component-based, visual programming software that allows data mining through visual programming and Python scripting without requiring any programming. It provides a link to download Orange and walks through loading a heart disease dataset and exploring it using various algorithms like KNN, Naive Bayes, decision trees, random forests, logistic regression, and neural networks. Performance results are compared for different algorithms
3. What is Data Mining?
3
• process of analyzing data from different perspectives
• summarizing it into useful information
• information that can be used to increase revenue,
cuts costs, or both.
• data mining helps analysts recognize significant
information, facts, relationships, trends, patterns,
exceptions, anomalies that might otherwise go
unnoticed.
4. Major Data Mining Tasks
4
1)Classification: Predicting an item class
2)Clustering: descriptive, finding groups of items
3)Deviation Detection: predictive, finding changes
4)Forecasting: predicting a parameter value
5)Description: describing a group
6)Link analysis: finding relationships and associations
5. Major Industries Using Data Mining
5
• retail
• finance
• education
• healthcare
• agriculture
• manufacturing
• transportation
• aerospace
6. Why Orange?
Open Source
Component based
No programming
Data visualization
Platform independent software
Allows clustering and classification
Data mining through visual programming
and python scripting
Introduction
Orange is component based visual
programing software for data mining.
machine learning and data analysis
Supports communication between data
scientists and domain experts.
You can get orange software from this link:
https://orange.biolab.si/getting-started/
6
10. Dataset: Heart Disease
ATTRIBUTES
● Narrowing diameter
● Cholesterol
● Chest pain
● Rest ECG
● Fasting blood sugar
● Max HR
● Age,gender and more
. 7
● Has 303 instances
● 13 attributes
● Categorical class with 2
values (0,1)
● In .csv format
● Source: pre loaded
datasets of Orange.
.
11. ● Age: heart disease increases with age greater than 65
● Fatty deposits called plaques also collect along your artery walls
● Slow the blood flow from the heart
● Causing coronary heart diseases.
● Gender: Heart disease is leading cause of death for both men and women.
Dataset: How following factors cause
Heart Disease?
1
1
12. ● Aangina: is chest pain or discomfort caused when your heart muscle doesn't
get enough oxygen-rich blood.
● Cholesterol: When there is too much cholesterol in your blood.
● it builds up in the walls of your arteries
● causing a process called atherosclerosis(heart disease),
● Diameter Narrowing:
● Heart disease is caused by the narrowing or blockage of the coronary arteries.
● Target attribute (0,1)
1
2
21. KNN(k nearest neighbor)
KNN is non-parametric method used for classification and regression.
Requires three things
The set of stored records.
Distance Metric to compute distance between records.
The value of k, the number of nearest neighbors to retrieve Unknown record
Math equation: d(p,q) = √Σ(pi – 𝒒𝒊)𝟐
21
26. Decision tree
Used to visually and explicitly represent decisions and decision making.
predictive modelling approaches used in:
statistics, data mining and machine learning
m
Entropy(D) pi log2(pi )
i1
26
34. Naïve Baye's
Also known as Naive Bayes Classifiers.
Attributes are statistically independent on one another.
Unlike other classifiers for a given class
There will be some correlation between features.
Explicitly models the features as conditionally independent given the class.
P(H|X) = P(X|H)(P H
𝑃(𝑋)
34
39. Random Forest
It is a flexible and simple
Random Forest algorithm avoid the over fitting problem.
Used for identifying the most important features from the training dataset.
It can be used for both classification and regression tasks.
39
44. Logistic Regression
Used to assign observations to a discrete set of classes.
Logistic regression can be binomial, ordinal or multinomial.
Binary (Pass/Fail)
Multi (Cats, Dogs, Sheep)
Ordinal (Low, Medium, High)
Can view probability scores underlying the model’s classifications.
44
48. Neural Network
Neural networks is learning algorithms.
Interpret sensory data
Through a kind of machine perception, labeling or clustering raw input.
Consist of different layers for analyzing and learning data.
Math equation :
f(X)=b+∑iwixi
48
58. Projects:
58
1. Traffic Communication Data Analysis
2. Job Scam Data Analysis
3. Email Communication Data Analysis
4. Social Media Data Analysis
5. Healthcare Data Analysis