Orange Tool Data Mining and Visualization

Dr Mithileysh Sathiyanarayanan
1

Orange Tool
2
Let’s Learn
Orange Data Mining and
Data Visualization Tool

What is Data Mining?
3
• process of analyzing data from different perspectives
• summarizing it into useful information
• information that can be used to increase revenue,
cuts costs, or both.
• data mining helps analysts recognize significant
information, facts, relationships, trends, patterns,
exceptions, anomalies that might otherwise go
unnoticed.

Major Data Mining Tasks
4
1)Classification: Predicting an item class
2)Clustering: descriptive, finding groups of items
3)Deviation Detection: predictive, finding changes
4)Forecasting: predicting a parameter value
5)Description: describing a group
6)Link analysis: finding relationships and associations

Major Industries Using Data Mining
5
• retail
• finance
• education
• healthcare
• agriculture
• manufacturing
• transportation
• aerospace

Why Orange?
 Open Source
 Component based
 No programming
 Data visualization
 Platform independent software
 Allows clustering and classification
 Data mining through visual programming
and python scripting
Introduction
 Orange is component based visual
programing software for data mining.
 machine learning and data analysis
 Supports communication between data
scientists and domain experts.
You can get orange software from this link:
https://orange.biolab.si/getting-started/
6

Getting Started With ORANGE!!
7

Dataset: Heart Disease
ATTRIBUTES
● Narrowing diameter
● Cholesterol
● Chest pain
● Rest ECG
● Fasting blood sugar
● Max HR
● Age,gender and more
. 7
● Has 303 instances
● 13 attributes
● Categorical class with 2
values (0,1)
● In .csv format
● Source: pre loaded
datasets of Orange.
.

● Age: heart disease increases with age greater than 65
● Fatty deposits called plaques also collect along your artery walls
● Slow the blood flow from the heart
● Causing coronary heart diseases.
● Gender: Heart disease is leading cause of death for both men and women.
Dataset: How following factors cause
Heart Disease?
1
1

● Aangina: is chest pain or discomfort caused when your heart muscle doesn't
get enough oxygen-rich blood.
● Cholesterol: When there is too much cholesterol in your blood.
● it builds up in the walls of your arteries
● causing a process called atherosclerosis(heart disease),
● Diameter Narrowing:
● Heart disease is caused by the narrowing or blockage of the coronary arteries.
● Target attribute (0,1)
1
2

Loading data file into data table:
14

EDA: Exploratory data analysis
● Distributions
.
15

Algorithms:
● KNN
● Naïve Bayes'
● Decision Tree
Selected Algorithm
● Neural Network
● Random Forest
● Logistic Regression
19

Experimental
Setup
20
This is how we drag and drop the widgets and
implements our algorithms

KNN(k nearest neighbor)
KNN is non-parametric method used for classification and regression.
Requires three things
 The set of stored records.
 Distance Metric to compute distance between records.
 The value of k, the number of nearest neighbors to retrieve Unknown record
Math equation: d(p,q) = √Σ(pi – 𝒒𝒊)𝟐
21

Decision tree
 Used to visually and explicitly represent decisions and decision making.
 predictive modelling approaches used in:
 statistics, data mining and machine learning
m
Entropy(D)  pi log2(pi )
i1
26

Naïve Baye's
 Also known as Naive Bayes Classifiers.
 Attributes are statistically independent on one another.
 Unlike other classifiers for a given class
 There will be some correlation between features.
 Explicitly models the features as conditionally independent given the class.
P(H|X) = P(X|H)(P H
𝑃(𝑋)
34

Random Forest
 It is a flexible and simple
 Random Forest algorithm avoid the over fitting problem.
 Used for identifying the most important features from the training dataset.
 It can be used for both classification and regression tasks.
39

Logistic Regression
 Used to assign observations to a discrete set of classes.
 Logistic regression can be binomial, ordinal or multinomial.
 Binary (Pass/Fail)
 Multi (Cats, Dogs, Sheep)
 Ordinal (Low, Medium, High)
 Can view probability scores underlying the model’s classifications.
44

Neural Network
 Neural networks is learning algorithms.
 Interpret sensory data
 Through a kind of machine perception, labeling or clustering raw input.
 Consist of different layers for analyzing and learning data.
Math equation :
f(X)=b+∑iwixi
48

Table to compare data
Recall Precision F-Measures
Neural Network 0.813 0.814 0.814
Logistic Regression 0.848 0.848 0.848
Random forest 0.807 0.807 0.807
54

Projects:
58
1. Traffic Communication Data Analysis
2. Job Scam Data Analysis
3. Email Communication Data Analysis
4. Social Media Data Analysis
5. Healthcare Data Analysis

59
EMAIL
COMMUNICATION
DATA
ANALYSIS

References:
https://www.youtube.com/watch?v=pYXOF0jziGM&index=6&list=PLmNPvQr9Tf-
ZSDLwOzxpvY-HrE0yv-8Fy
https://www.youtube.com/watch?v=bp0VtVS3LN4&index=9&list=PLmNPvQr9Tf-
ZSDLwOzxpvY-HrE0yv-8Fy
https://orange.biolab.si/getting-started/
https://en.wikipedia.org/wiki/Random_forest
https://en.wikipedia.org/wiki/Decision_tree_learning
http://orange.biolab.si/docs/latest/–
http://en.wikipedia.org/wiki/Data_mining–
http://www.oracle.com/technetwork/database/options/advanced-
analytics/odm/index.html–
http://eprints.fri.uni-lj.si/1150/1/DataMining-Kyoto.pdf
60

Orange Tool Data Mining and Visualization

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Orange Tool Data Mining and Visualization

Similaire à Orange Tool Data Mining and Visualization (20)

Plus de Mithileysh Sathiyanarayanan

Plus de Mithileysh Sathiyanarayanan (20)

Dernier

Dernier (20)

Orange Tool Data Mining and Visualization