Digital Marketing Plan, how digital marketing works
Random forest
1.
2. Random forest is a classifier
An ensemble classifier using many decision tree models.
Can be used for classification and regression
Accuracy and variable importance information is provided with the result
A random forest is a collection of unpruned CART-like trees following specific
rules for
Tree growing
Tree combination
Self-testing
Post-processing
Trees are grown using binary partitioning
3. Similar to decision tree with a few differences
For each split-point, the search is not over all variables but just over a part of variables
No pruning necessary. Trees can be grown until each node contain just very few
observations
Advantages over decision tree
Better prediction (in general)
No parameter tuning necessary with RF
Terminology
Training size (N)
Total number of attributes (M)
Number of attributes used (m)
Total number of trees (n)
4. A random seed is chosen which pulls out at random a collection of samples from
training dataset while maintaining the class distribution
With this selected dataset, a random set of attributes from original dataset is
chosen based on user defined values. All the input variables are not considered
because of enormous computation and high chances of over fitting
In a dataset, where M is the total number of input attributes in the dataset, only
m attributes are chosen at random for each tree where m<M
The attribute for this set creates the best possible split using the gini index to
develop a decision tree model. This process repeats for each of the branches until
the termination condition stating that the leaves are the nodes that are too small
to split.
5. Information from random forest
Classification accuracy
Variable importance
Outliers (Classification)
Missing Data Estimation
Error Rates for Random Forest Object
Advantages
No need for pruning trees
Accuracy and variable importance generated automatically
Overfitting is not a problem
Not very sensitive to outliers in training data
Easy to set parameters
6. Limitations
Regression cant predict beyond range in the training data
Extreme values are not predicted accurately
Applications
Classification
Land cover classification
Cloud screening
Regression
Continuous field mapping
Biomass mapping
7. Efficient use of Multi-Core Technology
Though it is OS dependent, but the usage of Hadoop guarantees efficient use of
multi-core
8. Its a technique from machine learning for learning a linear classifier from labelled
examples
Similar to perceptron algorithm
While perceptron algorithm uses additive weight-update scheme, winnowing uses
a multiplicative weight-update scheme
Performs well when many of the features given to the learner turns out to be
irrelevant
During training, its shown a sequence of positive and negative examples. From
these it learn a decision hyperplane which can be used to novel examples as
positive or negative
Uses linear threshold function (like the perceptron training algorithm) as
hypothesis and performs incremental updates to its current hypothesis
9. Initialize the weights w1,…….wn to 1
Both winnow and perceptron algorithm uses the same classification scheme
The winnowing algorithms differs form the perceptron algorithm in its updating
scheme.
When misclassifying a positive training example x (i.e. a prediction was negative because
w.x was too small)
When misclassifying a negative training example x (i.e. Prediction was positive because
w.x was too large)
10. SPAM Example – each email is a Boolean vector indicating which phase appears
and which don’t
SPAM if at least one of the phrase in S is present
11.
12. Initialize the weights w1, …..wn = 1 on the n variables
Given an example x = (x1,……..xn), output 1 if
Else output 0
If the algorithm makes a mistake:
On positive – if it predicts 0 when f(x)=1, then for each xi equal to 1, double the value of
wi
On negative – if it predicts 1 when f(x)=0, then for each xi equal to 1 cut the value of wi
in half
13.
14. The principle of maximum entropy states that, subject to precisely stated prior
data, the probability distribution which best represents the current state of
knowledge is the one with the largest entropy.
Commonly used in Natural Language Processing, speech and Information
Retrieval
What is maximum entropy classifier?
Probabilistic classifier which belongs to the class of exponential models
Does not assume the features that are conditionally independent of each other
Based on the principle of maximum entropy and forms all models that fit our training
data and selects the one which has the largest entropy
15. A piece of information is testable if it can be determined whether a given
distribution is consistent with it
The expectation of variable x is 2.87
And p2 + p3 > 0.6
Are statements of testable information
Maximum entropy procedure consist of seeking the probability distribution which
maximizes information entropy, subject to constrains of the information.
Entropy maximization takes place under a single constrain: the sum of
probabilities must be one
16. When to use maximum entropy?
Since it makes minimum assumptions, we use it when we don’t know about the prior
distribution
Used when we cannot assume conditional independence of the features
The principle of maximum entropy is commonly applied in two ways to inferential
problems
Prior Probabilities: its often used to obtain prior probability distribution for Bayesian
inference
Maximum Entropy Models: involved in model specifications which are widely used in
natural language processing. Ex. Logistic regression