A simulated decision trees algorithm (sdt)

411
A Simulated Decision Trees Algorithm (SDT)
Mohamed H. Farrag1
, Maha M. Hana2
, Mona M. Nasr2
1
Theodor Bilharz Research Institute, Ministry of Scientific Research, Cairo, Egypt
eng.m.hassan@gmail.com
2
Faculty of Computers and Information, Helwan University, Cairo, Egypt
mahana_eg@yahoo.com, m.nasr@helwan.edu.eg
Abstract — The customer's information contained in
databases has increased dramatically in the last few years.
Data mining is a good approach to deal with this volume of
information to enhance the process of customer services.
One of the most important and powerful techniques of data
mining is decision trees algorithm. It appropriate for large
and sophisticated business area but it's complicated, high
cost and not easy to use by not specialists in the field. To
overcome this problem SDT is proposed which is a simple,
powerful and low-cost proposed methodology to simulate the
decision trees algorithm for different business scopes and
nature. SDT methodology consists of three phases. The first
phase is the data preparation which prepare data for
computing calculations, the second phase is SDT algorithm
which represents a simulation of decision trees algorithm to
find the most important rules that distinguish specific type of
customers, the third phase is to visualize results and rules for
better understanding and clarifying the results. In this paper
SDT methodology is tested by a dataset consists of 1000
instants for German Credit Data belongs to one of German
bank customers. SDT selects the most important rules and
paths that reaches the selected ratio and tested cluster of
customers successfully with interesting remarks and finding.
Keywords: Customer relationship Management, Data Mining,
Decision Trees, Customer services
I. INTRODUCTION
Simulated Decision trees methodology SDT is a
simple and powerful form of multiple variable
analyses and classification methodology that aims to
classifying customer data to produce the most
important characteristics that distinguish specific
type of customers.
This Paper is a trial to presents SDT methodology
that contains all decision trees features but at the
same time is suitable for small and medium
organization scopes, different business fields, ease of
use even by not specialists in the field for
distinguishing customers' clusters to enhance
customer services process.
This Paper is organized as follows: section two
demonstrates the related work, section three
discusses SDT methodology and section four
explains the experiment. While the results and the
evaluation of SDT methodology are indicated in fifth
section, finally conclusion and future work are in the
last section.
II. RELATED WORK
Knowledge Discovery in Database (KDD) is the
non-trivial process of identifying valid, potentially
useful, and ultimately understandable patterns in
data. The knowledge discovery process comprises
six phases as Data selection, Data cleansing,
Enrichment, Data transformation or encoding,
Appling data mining tools and Displaying of the
discovered information.
Customer Relationship Management (CRM) is
defined as: "Communication with the customer,
developing products and services that customers
need, and selling or supporting them in manners that
customer’s desire"[1]. CRM is an Enterprise
approach to understanding and influencing customer
behavior through meaningful communications in
order to improve customer acquisition, customer
retention, customer loyalty, and customer
profitability to build a profitable long term
relationship with the customers in the field of
marketing, sales, customer services and support.
The CRM framework classified in two
classifications the first one is operational
classification which is according to the business
automation. While the second classification is the
analytical classification which is according to the
customer characteristics and behaviors to help an
organization to effectively allocate resources to the
most profitable group of customers or the target
customers.
The CRM framework has four dimensions the
first one is customer identification which concern
with target customer analysis and customer
segmentation, second is customer attraction which
concern with direct marketing, the third is customer
retention which concern with one to one marketing

412
and loyalty programs while customer development is
the fourth one concern with customer lifetime value
analysis, up cross selling and market basket analysis.
Organization need to discover the hidden
knowledge in the stored data to use it to acquire and
retain potential customers and maximize customer
return value that could be by using data mining tools
which could help the organization to better
discriminate and effectively allocate resources to the
most profitable group of customers. Also, Data
Mining is important tool to transform customer's
data into meaning patterns to help in predicting and
distinguishing different customer clusters. Data
Mining is defined from different views and scopes.
The first definition is "A branch of the applied
informatics, which allows us to sift through large
amounts of structured or unstructured data in attempt
to find hidden patterns and/or rules". The second
definition is "The data mining methods as tools for
searching databases with special algorithms to
identify general patterns which can be used in the
classification of the individual observations and
making predictions". Another definition is "Data
mining is the search for valuable information in large
volumes of data” [1].
Data mining has several roles in business
process. First role is association which aims to
establishing relationships between items which exist
together in a given record. Second role is
classification which use for building a model to
predict future customer behaviors through
classifying customer data into a number of
predefined classes based on certain criteria. Third
role is clustering which is to segment a
heterogeneous population into a number of more
homogenous clusters. Fourth role is forecasting
which estimates the future value based on a record’s
patterns. It deals with continuously valued outcomes.
Fifth role is regression which is a kind of statistical
estimation technique used to map each data object to
a real value provide prediction value. Sixth role is
sequence discovery which is concerned with
identification of associations or patterns over time.
Seventh role is visualization which can be defined as
the presentation of data so that users can view
complex patterns.
Data Mining Algorithms the previous data
mining model could be used through the next
algorithms: Association Rule is defined as "a way to
find interesting associations among large sets of data
items". Also can be defined as "if /then statements
that help to uncover relationships between seemingly
unrelated data in a relational database or other
information warehouse". An association rule has two
parts, an antecedent (if) and a consequent (then). An
antecedent is an item found in the data. A
consequent is an item that is found in combination
with the antecedent [5].
Decision Trees structure is a hierarchy of
branches within branches that produces the
characteristic inverted decision trees structure form.
The nested hierarchy of branches is called a Decision
Tree. Decision Trees structure is simple, although
powerful form of multiple variable analyses. They
provide unique capabilities to supplement, to
complement, and to substitute a variety of data
mining tools and techniques [4].
Genetic Algorithms (GAs) are adaptive heuristic
search algorithm premised on the evolutionary ideas
of natural selection and genetic. Artificial Neural
Networks (ANN) is an information processing
pattern. The key element of this model is the novel
structure of the information processing system. It is
composed of a large number of highly
interconnected processing elements working in
harmony to solve specific problems.
K-Nearest neighbor (K-NN) is a nonparametric
method in that no parameters are estimated, output
variables can either be interval variables in which
case the K-NN algorithm is used for prediction while
if the output variables are categorical, either nominal
or ordinal.
Linear discriminate (LDA) / logistic regression
(LR) they are widely used multivariate statistical
methods for analysis of data with categorical
outcome variables. Both of them are appropriate for
the development of linear classification models. The
difference between the two methods is LR makes no
assumptions on the distribution of the explanatory
data; LDA has been developed to give better results
in the case when the normality assumptions are
fulfilled.

413
Preperation
Phase
Algorithm
Phase
Visualize
results
phase
Data Mining has two methods depends on the
nature of use and nature of results. The first is
Predictive Methods is depending on using some
variables to predict unknown or future values of
other variables. The second is Descriptive Methods
is depending on find human-interpretable patterns
that describe the data.
III. SIMULATED DECISION TREES
METHODOLOGY
The proposed methodology is a simulation of
decision trees structure deals with multiple variable
datasets to identify the most important customer
characteristic that distinguish specific type of
customers featured with visualization aid to visualize
results.
Figure (1): SDT Methodology Phases
A. Preparation Phase:
Figure (2): Preparation phase
Preparation phase consists of three steps as shown in
Figure (2).
1. Preprocessing Data
This step concerns with data preparation which aims
to adopt the dataset values to be suitable for
computer calculations. It numerates the dataset
attribute, revises the data for odds values and missed
values and divides them into 2:4 groups according to
the range of its values and the density distribution of
each group.
2. Enter Attributes and Its Sets
This step concerns with entering the attributes names
of the data set and its short names to use by the
algorithm then enter the ranges for each set in each
attribute.
3. Enter Experimental Values
This Step has the following steps:
- Step1: Selects the needed attributes to apply
algorithm on them.
- Step2: Selects the main attribute to choose one of
its sets as a start node.
- Step3: Selects the start set node from the main
attribute sets which selected in step2.
- Step4: Enters tested ratio "R" which algorithm
compares and ranks results paths and rules
according to its value.
- Step5: Enters studied case value in numeric
format represents in "C" which algorithm
generates calculation using this studied case
instants in the dataset.
B. A simulated Decision trees algorithm phase:
SDT algorithm is a proposed algorithm consists of
six procedures as shown in Figure (3):
Figure (3): Algorithm Procedures
- Procedure1: Generate Sets Paths Seeds:
In this step algorithm searches for the sets for each
selected attribute and then generate the seeds for
each probably path starting with the selected node
using these attributes.
- Procedure2: Generate Potential Paths:
In this step algorithm generates all possibly paths
and rules from the previous paths seeds, then remove
the repetitive sets and eliminate the redundancy of
sets to the same attribute in each path to allow one
set for each attribute in each path.
- Procedure3:Generate Query Statements:
In this step algorithm generates the selected query
statements for each path using attributes names and
their sets.
Preparation
Phase
Preprocessing
data
Enter attributes
and its sets
Enter
expermintal
values
Generate sets paths seeds
Generate probably paths
Generate query statements
Adding Condition for studied case
Generate ratio
Select valid paths

414
- Procedure4: Adding Condition for Selected
Studied Case:
In this step algorithm uses the previous selected
query statement and gives them the conditions for
studied case value on the sample dataset, then
calculate the number of instances for each path.
- Procedure5: Generate ratio:
In this step algorithm generate the ratio for each path
by dividing the number of instances which was
founded in the dataset for each path represents n and
total number of instances in sample dataset for this
type of customers represents N.
- Procedure6: Select valid paths:
In this step algorithm represents only the paths that
have ratio greater than or equal tested ratio "R".
C. Visualize result phase
This phase aims to visualize results by display and
draw the selected path from all paths that have ratio
greater than or equal tested ratio "R".
IV. EXPERIMENT
Data Set Description
The algorithm uses a dataset for German Credit
Data, source information is Professor Dr. Hans
Hofmann, Institute for Statistics, Hamburg
University. This dataset contains 1000 instants
divided in two clusters of customers. Clustering is
based on measuring the risk of customers. The first
cluster is "Low-Risk cluster" that has 700 instants
while the second cluster has 300 instants for "High-
Risk cluster". The data set has twenty attributes
divided in two statuses of attributes. First status is 7
numerical attributes and second status is 13
qualitative attributes. These attributes are different in
nature and scope. The attributes names and nature
are shown in Table (1).
Table (1): Attributes Names and Status
# Attributes Status
1 Status of existing checking account qualitative
2 Duration in month numerical
3 Credit history qualitative
4 Purpose qualitative
5 Credit amount numerical
6 Savings account/bonds qualitative
7 Present employment since qualitative
8
Installment rate in percentage of disposable
income
numerical
9 Personal status and sex qualitative
10 Other debtors / guarantors qualitative
11 Present residence since numerical
12 Property qualitative
13 Age in years numerical
14 Other installment plans qualitative
15 Housing qualitative
16 Number of existing credits at this bank numerical
17 Job qualitative
18
Number of people being liable to provide
maintenance for
numerical
19 Telephone qualitative
20 foreign worker qualitative
A. Data preparation phase
This phase aims to detect and classify dataset
attributes, then assign observations values from
dataset to each attribute and clear dataset if any
missed values.
1. Preprocessing Data
This step consists of three main steps to insure that
each used attribute is a qualitative attribute. This
attributes having 2:4 sets according to its values and
the density distribution of each group instants. The
numerical attributes which needs to convert to
qualitative attribute are shown in Table (2)
Table (2): Numerical Attributes
- Step1: Converts the numerical attributes to
qualitative one.
# Attributes Set Set min and max value
2
Duration in
month
A301 ... <= 12
A302 13 ... <= 24
A303 25 ... <= 36
A304 >= 37
3 Credit history
A30
no credits taken/all credits paid back
duly
A31 all credits at this bank paid back duly
A32
existing credits paid back duly till
now
A33 delay in paying off in the past
A34
critical account/other credits existing
(not at this bank)
5 Credit amount
A305 <= 5000
A306 5001 ... < 12000
A307 > 12000
10
Other debtors /
guarantors
A101 none
A102 co-applicant
A103 guarantor
13 Age in years
A316 ... < 20 year
A317 20 <= ... < 35 years
A318 35 <= ... < 50 years
A319 .. >= 50 years
17 Job
A171 unemployed/ unskilled - non-resident
A172 unskilled - resident
A173 skilled employee / official
A174
management/ self-employed/ highly
qualified employee/ officer
Attributes Status
Duration in month numerical
Credit amount numerical
Installment rate in percentage of disposable
income
numerical
Present residence since numerical
Age in years numerical
Number of existing credits at this bank numerical
Number of people being liable to provide
maintenance for
numerical

415
- Step2: Divide the new attributes into 2:4 groups
according to the range of its values and the
density distribution of each group.
- Step3: Replace the numeric values within the new
qualitative values in the dataset.
Table (3): Example of Dataset Attributes and its sets after
Preprocessing Data and Converting Numerical Attributes
2. Enter Attributes and its Sets
This step aims first to enter the dataset attributes,
attributes short names and attributes codes to use by
the algorithm. For example, the attribute "Age in
Years" its code is "13" and its short name is "AIY".
Second to enter the attributes sets names, its short
name and ranges as shown in Table (4).
Table (4): Example for "Age In Years" Attribute Sets
Set Range
A316 ... < 20 year
A317 20 <= ... < 35 years
A318 35 <= ... < 50 years
A319 .. >= 50 years
3. Enter Experimental Values
This step has five steps to ensure that all needed
parameters and inputs to algorithm are clearly
defined as:
- Step1
This step aims to sets the needed attributes from the
attributes list to run the algorithm using these
attributes and its sets. For example, the selected
attributes in this experimental are shown in Table
(5).
Table (5): Selected Attributes To Test by The Algorithm
Selected Attributes
Credit Amount
Credit History
Duration In month
Housing
- Step 2
Step2 aims to select the main attribute to select one
of its sets as a start node. For example, the selected
attribute is "Job".
- Step3
Step3 aims to select the start set node from the "job"
attribute sets to be the start node in the algorithm.
For Example, the selected set from "Job" attribute is
set "A173" which refers to "skilled employee /
official"
- Step4
Step4 aims to enter the selected ratio represents in
"R" that all accepted probably paths must get ratio
greater than or equal it by dividing the number of
instants belongs this path and how many instants of
them belongs to the tested case. For example, the
selected ratio is 65%.
- Step5
Step5 aims to enter the tested case represents in "C"
in numeric format as 1, 2 or 3 to represent the
selected cluster to perform calculation on its instants.
For example, the tested case is "1" which belongs to
"Low Risk Customers".
B. A simulated Decision Trees Algorithm
Procedures SDT:
- Procedure1: Generate sets paths seeds:
This procedure aims to gets the sets for each selected
attribute as shown in Table (6) and set the start node
as shown in Table (7).
Table (6): Selected Attributes Sets
Attribute Set Set range
Credit amount
A305 <= 5000
A306 5001 ... < 12000
A307 >= 12000
Credit history
A30
no credits taken/all credits
paid back duly
A31
all credits at this bank paid
back duly
A32
existing credits paid back duly
till now
A33 delay in paying off in the past
Duration In month
A301 ... <= 12
A302 13 ... <= 24
A303 25 ... <= 36
A304 >= 37
Housing
A151 rent
A152 own
A153 for free
Table (7): Start set node.
Attribute Set Set range
Job A173 skilled employee / official
- Procedure2: Generate Potential Paths:
This procedure aims to generate all probably paths
from the previous paths seeds starting with the
selected node using these attribute, second remove
the repetitive sets and eliminate the redundancy of
sets to the same attribute in each path to allow only
one set for each attribute in each probably path as

416
shown in Table (8) by completing this procedure the
algorithms gives 272 probably paths.
Table (8): Example of Generated Probably Paths and Rules
Path ID Attribute Name Set
1 Job A173
2
Job A173
Credit amount A305
3
Job A173
Credit amount A306
4
Job A173
Credit amount A307
5
Job A173
Credit history A30
- Procedure3: Generate Query Statements:
This procedure aims to generate the selected query
statement for each path using attributes names and
their sets to run on dataset to calculate the number of
instants belongs to each probably path as shown in
Table (9).
Table (9): Example of the Selected Query Statement for Each
Probably Path.
Id Path
1 job='A173'
2 job='A173' and Credit Amount='A305'
5 job='A173' and Credit History='A30'
- Procedure4: Adding condition for selected
studied case:
This procedure aims to use the previous selected
query statement and gives each of them the tested
case value then calculates the number of instances
for each path as shown in Table (10).
Probably Path
id Paths selected query Statements for tested case
1
Count (*) from samples where job='A173' and
customer=1
2
Count (*) from samples where job='A173' and Credit
Amount='A305' and customer=1
3
4
5
History='A30' and customer=1
- Procedure5: Generate ratio:
This procedure aims to generate the ratio for each
path by dividing the number of instances belongs to
this path in the dataset for the tested case and
represent it "n" and total number of instances
belongs to this path for all cases in the dataset
represents N as shown in Table (11).
Probably Path and its Ratio
Id
Paths selected query statement for
tested case
Ratio
15 job='A173' and housing='A152' 0.75663717
241 job='A173' and housing='A152' 0.75663717
2 job='A173' and ca='A305' 0.72952381
18 job='A173' and ca='A305' 0.72952381
33 job='A173' and ca='A305' 0.72952381
- Procedure6: Select valid paths:
This procedure aims to represents only the probably
paths that have ratio greater than or equal to the
selected ratio "R"=65% as shown in Table (12).
Table (12): Example of the Accepted Paths and Rules that Have
Ratio ≥ to the Selected Ratio "R" = 65%
Path id Valid paths Ratio>=65%
146 job='A173' and ch='A34' and ca='A305' 0.8516129
9 job='A173' and ch='A34' 0.83783784
145 job='A173' and ch='A34' 0.83783784
162 job='A173' and dm='A301' and ca='A305' 0.79207921
10 job='A173' and dm='A301' 0.7902439
The algorithm selects only 43 paths from 272
probably paths are valid paths that have ratio greater
than or equal to selected ratio "R" =65%.
C. Visualize results phase
This phase aims to display and draw the selected
path from all paths that have ratio greater than or
equal to the selected ratio "R". For example drawing
the path number 162 as shown in Figure (4).

417
Figure (4): Example of Drawing Path Number 162
V. RESULTS AND EVALUATION OF SDT
Simulated Decision Trees methodology SDT is a
simple and powerful form of multiple variable
analyses and classification algorithm that aims to
classifying customer data to produce the most
important characteristic that distinguish specific type
of customers. SDT methodology has two main
characteristics. First, it doesn’t have any special
requirements and second its implementation is
flexible enough to adapt completely different data
domains and different datasets structures.
Our initial findings show some interesting results
that the field test conducted by our dataset proved
that the algorithm is very accurate and the target-
oriented campaign is very effective. Helping in
turning raw data into meaning patterns to increase
knowledge and awareness of business, enable users
to deploy that knowledge in a simple and powerful
set of human readable rules. Many benefits and
advantages will be gained by applying this work on
small and medium organizations scopes in different
business area like banks, call centers, markets and
training centers.
The ability to produce results on screen or as print
out reports, ability to make backup at end of each
procedure, it also give the user the ability to load last
experimental values and change selected ratio "R"
value or tested case "C" type, satisfy the needs and
possibilities of decision makers, ease to use, easy to
make changes and low cost.
VI. CONCLUSION AND FUTURE WORK
SDT methodology is a new proposed methodology
to produce the most important characteristic and
rules from different customers’ attributes that
distinguish specific cluster of customers. It has many
advantages first is Simplicity by giving the ability to
select the needed attributes without any restriction
and find the relationship between each other and
ability to change the selected ratio according to
business requirements. Second is Efficiency for two
reasons the first one is expressing complex
alternatives clearly. Second reason is exploring and
modifying new information from output. Third is
Visualization Aid by representing and visualizing the
decision alternatives, possible outcomes this feature
is helpful in comprehending sequential decisions and
outcome dependencies. The fourth is Harmony
comes by merging SDT with other project
management tools to enhance the process of
customer relationship management
It is a suggested to test SDT methodology using
different data domains and study its accuracy and
capacity on different scopes and natures.
REFERENCES
[1] Gramatikov, M., (2003), Data Mining Techniques and the
Decision Making Process in the Bulgarian Public
Administration, NISP Acee Conference, Bucharest, Romania.
[2] Srivastava, J. (2000 January), Data Mining for Customer
Relationship Management, ACM SIGKDD Explorations
Newsletter, Volume 1, No 2.
[3] Agrawal, D., (2007 November), Building Profitable
Customer Relationships with Data Mining, CSI Research
Journal of India.
[4] Jia-Lang Seng & T.C. Chen, (2010 December), An Analytic
Approach to Select Data Mining for Business Decision,
Taiwan Expert Systems with Applications, Volume 37, Pages
8042–8057.
[5] Giha, E. F., Singh, P.Y., & Ewe, T. H., (2006 June). Mining
Generalized Customer Profiles, AIML 06 International
Conference, Sharm El Sheikh, Egypt, Volume 6, Pages 141-
147.
[6] Hsieh, N. & Chu, K., (2009), Enhancing Consumer Behavior
Analysis by Data Mining Techniques, International Journal of
Information and Management Sciences, Volume 20, No
1, Pages 39-53.
[7] İkizler, N., & Güvenir, H., (2001), Mining Interesting Rules
in Bank Loans Data, Proceedings of the Tenth Turkish
Symposium on Artificial Intelligence and Neural Networks,
Pages 238-246.
[8] Bartok, J., Habala, O., Bednar, P., Gazak, M. & Hluchý, L.,
(2010) Data Mining and Integration for Predicting Significant
Meteorological Phenomena, ICCS 2010 Procedia Computer
Science Volume 1, No 1, Pages 37–46.
[9] Çiflikli, C., & Kahya-Özyirmidokuz, E., (2010 December),
Implementing a data mining solution for enhancing carpet
manufacturing productivity. Knowledge-Based Systems,
Volume 23, No 8, Pages 783–788.
[10] Pauray S.M., (2010), Mining top-k frequent closed item sets
over data streams using the sliding window model, Taiwan

418
Expert Systems with Applications Volume 37, Pages 6968–
6973.
[11] Kamrunnahar, M., & Urquidi-Macdonald, M., (2010 March),
Prediction of corrosion behavior using neural network as a
data Mining tool, Corrosion Science, Volume 52, No 3, Pages
669–677.
[12] Thanuja V., Venkateswarlu, B., & Anjaneyulu, G. S. G. N.,
(2011 June), Applications of Data Mining In Customer
Relationship Management. Journal of Computer and
Mathematical Sciences Volume 2, No 3, Pages 399-580.

A simulated decision trees algorithm (sdt)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à A simulated decision trees algorithm (sdt)

Similaire à A simulated decision trees algorithm (sdt) (20)

Dernier

Dernier (20)

A simulated decision trees algorithm (sdt)