Data Modeling using Symbolic Regression

Patrick Nicolas
http://patricknicolas.blogspot.com
07/13/2013

Need for reliability
Copyright 2013 Patrick Nicolas 2
Existing algorithms used in recommendation,
predictive behavior of consumers or target
advertising do not have to be very accurate: the
negative impact of recommending a book, movie
incorrectly or failing to detect the interest of a
consumer is very limited.
However, some problems requires a far more reliable
solution: failure to preserve large amount of data,
detect security intrusion or predict the progress of a
disease have grave consequence.

Options
Traditional data mining approaches such as
clustering (Unsupervised learning), generative
or discriminative supervised learning algorithm
failed to capture the evolutionary nature of a
system with its states and underlying data.

Supervised learning
Supervised learning is effective for problems with a
large training compared to the dimension of the model.
However it suffers from the following limitations:
• Over-fitting: A supervised learning algorithm needs
a large training to account for bias in the training set
• No descriptive (human) knowledge representation
• Role of domain expert is limited to providing labeled
data and validate the results.
• The model has to be retrained in case of false positive
or false negative

Unsupervised learning
Unsupervised learning methods such as Spectral
Clustering, Kernel-based K-Means are used for anomaly
detections or dimension reduction but have drawbacks:
• Poor classification, in case of mix discrete &
continuous variables
• No descriptive knowledge representation
• Limited leverage of domain expertise: Role of the
domain expert is limited to validating the cluster
• Clusters have to be rebuilt if number of outliers
increases

Symbolic Regression
Symbolic Regression addresses the key limitations
of unsupervised and supervised learning methods.
It combines evolutionary computation with
reinforcement learning to provide domain experts
a tool to create, evaluate and modify rules, policies
or models.
The most commonly used algorithms in Symbolic
Regression
•Genetic programming
•Learning Classifiers System

Symbolic Regression
• Optimization of data archiving
• Intelligent data and instrumentation
streaming
• Predicting behavior of ecommerce site during
“flash” or holiday sales
• Monitoring and predicting security
vulnerabilities in data centers
• Distribution of network traffic and flow in
public cloud
Symbolic Regression is used in very different
applications such as

Symbolic representation
The goal is to extract knowledge from data (numerical,
textual, events…) as symbolic or human readable
representation using primitives or operators
• Boolean operators OR, AND, XOR,..
• Numerical functions Sin, Exp, Sigmoid,….
• Numerical operators +, *, o, …
• Differentiable operators derivative, integral,.
• Logical operators: Predicate, rules,..
Domain ExpertDomain Expert
Data MiningData Mining
DataData
sinIf _ then _
_ has a _
If _ then _
exp
_ * _

Knowledge Extraction
Knowledge extraction is the process of selecting,
combining the appropriate symbolic primitives or
operators to describe and predict states of a system.
Expertise
Model
Expertise
Model
sinIf _ then _
_ has a _
If _ then _
exp
_ * _
f”
SystemSystem
State/DataState/Data
PredictionPrediction

Knowledge Primitives
The generation of knowledge from a set of symbolic
primitives to represent underlying state of a system is a NP
problem (combinatorial explosion). Moreover computers
process data in binary format (theory of information).
Value
Binary
Encoding
The solution is to represent knowledge as symbolic
primitives in binary format.

Knowledge Encoding
The most common representation is to encode
symbolic primitives as sequences 0 & 1’s
f(x) = 2.sin(x) – exp(x*x)
- ( * (sin,2), o (exp, sqr))
- * o sin 2 exp sqr
long long long
Binary data
0101001001110111011101110111011101111111000111111011101101000001001000101010

Data Modeling using Genetic Algorithm
For a given state of a system we need to find the
optimal model (combination of primitives) to describe
the current state using a Genetic Algorithm. The (0,1)
encoding is associated to a chromosome with selection,
cross-over, transposition and mutation operators
100100111011101110111011101110oo
10000010111100001010010011011
1001010111011101110100100111011
100000101111000010011011101110
Cross-over
Parents Off-springs
10010011101110111000111011101110 100100111010111101110111111100110
Mutation
10010011101110111000111011101110
Transposition
101110100100111011011101110111011
s e se

Computation Flow of Genetic Algorithm
Initial Pool
of Models
Initial Pool
of Models
EncodingEncoding Initial
Chromosomes
Initial
Chromosomes
New
population
New
population
SelectionSelectionFitnessFitness
Cross-overCross-over
MutationMutation
Fittest
Chromosome
Fittest
ChromosomeDecodingDecoding
Best ModelBest Model
Once the initial set of chromosomes is randomly
generated the algorithm iterates until fittest
chromosome emerges
TranspositionTransposition

Limitation of Genetic Algorithm
The selection of the best chromosome representing
the best classifier (or model) relies on the
computation of a fitness value under the assumption
that the objective does not change over time.
As most system evolves over-time, so does the
objective. Reinforcement learning is used to adjust
the objective using a reward/credit assignment
mechanism.

EncodingEncoding
Concept of Reinforcement Learning
As the state of the system evolves over-time, it
rewards or punishes the fittest classifier which action
has been executed. The rewards or punishment is
used to adjust the objective and fitness function.
System
ProbesProbes EffectorsEffectors RewardReward
Best
Action
Best
Action
Reward AssignmentReward Assignment
DecodingDecoding
Genetic
Algorithm
Genetic
Algorithm
PrimitivesPrimitives
Best
classifier
Best
classifier

Elements of Reinforcement Learning
The main challenge of reinforcement learning is to predict the impact
of each action An on the global state. We need …
•Actions (or classifiers) that support logic, IF/THEN, numerical,
y=f(x1, … xn) and discrete {ai} classifiers to predict the impact of a
remedial action on the security of the system
1.A metric to measure the security of the overall system (distance
between the current state and the baseline)
1.An actions discovery & adaptation mechanism
1.An efficient optimizer to select the best action at any state:
Stochastic Descent Gradient for continuous variables {xi} only or
Genetic Algorithm for mix of Boolean, Integer and Double

Putting All Together
EnvironmentInitial
Knowledge
Initial
Knowledge
EncodingEncoding
Expert Supervised
Learning
Classifiers
Population
Classifiers
Population
SelectSelect
Cross-
over
Cross-
over
MutateMutate
ProbesProbes EffectorsEffectors RewardReward
Best
Classifiers
Best
Classifiers
Actions
Predictor
Actions
Predictor
ActionAction
Q-LearningQ-LearningReward AssignmentReward Assignment
Genetic AlgorithmReinforcement Learning
MatchMatch
TransposeTranspose

References
• Genetic Programming: On the Programming of Computers by
Means of Natural Selection - J. Koza
• Reinforcement Learning: An Introduction (Adaptive
Computation and Machine Learning) – R. Sutton, A. Barto
• http://www.mendeley.com/catalog/symbolic-regression-via-genetic-
programming/

Data Modeling using Symbolic Regression

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data Modeling using Symbolic Regression

Similaire à Data Modeling using Symbolic Regression (20)

Plus de Patrick Nicolas

Plus de Patrick Nicolas (11)

Dernier

Dernier (20)

Data Modeling using Symbolic Regression