# Classifiers

Ayurdata
29 Dec 2019
1 sur 3

### Classifiers

• 1. Use of classifiers in research problems Classifiers are algorithms which map the input data to any specific type of output category. They can be used to build dynamic models with high precision and accuracy such that the resulting model can be used to predict or classify previously unknown data points. Classifiers have found wide use in data science applications in various domains. For instance, classification of a new tumour as malignant or benign, identifying a mail as spam or ham, marking an insurance claim as possibly fraudulent or genuine are different instances of classification. Classification algorithms use training data, i.e., they learn from example data and build a model or procedure to identify a new data point as belonging to a particular category. Thereby they belong to the class of supervised learning methods. There are a number of classifiers that can be used to classify data on the basis of historic and already existing data. A very short description of these methods is given here just to introduce the concepts. Logistic Regression As a simple case, consider a logistic model with two predictors x1 and x2, and one binary response variable y which we denote as 𝑝 = 𝑃(𝑌 = 1). We assume a linear relationship between the predictor variables and the log-odds of the event. This relationship can be expressed as, log 𝑝 1 − 𝑝 = β + β 𝑥 + β 𝑥 By simple algebraic manipulation, the probability that Y=1 is, 𝑝 = 𝑒 𝑒 + 1 The above formula shows that once the β ′𝑠 are estimated, we can compute the probability that Y=1 for a given observation, or its complement Y=0. Decision Trees In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on most significant splitter/differentiator in input variables. The end result of the algorithm would be a tree like structure with root, branch and leaf nodes (target variable). Decision trees use multiple algorithms to decide to split a node in two or more sub- nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. Although several criteria like Gini index, chi-square, reduction in variance are available for identifying the nodes, one popular measure used for spitting is the information gain. This is equivalent to selecting a particular node with maximum reduction in entropy as measured by Shannon’s index (H). 𝐻 = − 𝑝 log 𝑝 where s is the number of groups at a node and 𝑝 indicate the proportion of individuals in the ith group.
• 2. Random Forests Ensemble learning is a type of supervised learning technique in which the basic idea is to generate multiple models on a training dataset and then simply combine (average) their output rules or their hypotheses to generate a stronger model which performs very well. Random forest is a classic case of ensemble learning. Decision trees are considered very simple and easily interpretable but a major drawback in them is that they have poor predictive performance and poor generalization on test set and so sometimes are called weak learners. In the context of decision trees, random forest is a model based on multiple trees. Rather than just simply averaging the predictions of individual trees (which we could call a ‘forest’), this model uses two key concepts that gives it the name ‘random’ viz., (i) random sampling of training data points when building trees (ii) random subsets of features considered when splitting nodes. The idea here is that instead of producing a single complicated and complex model which might have a high variance that will lead to overfitting or might be too simple and have a high bias which leads to underfitting, we will generate lots of models using the training set and at the end combine them. Support Vector Machines Given a set of training examples, each marked as belonging to one or the other of two categories, a Support Vector Machine (SVM) training algorithm builds a model that assigns new examples to one category or the other. In theory, SVM is a discriminative classifier formally defined by a separating hyperplane. In other words, given labelled training data, the algorithm outputs an optimal hyperplane which categorizes new examples. Thus, the hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. In practice, there are many hyperplanes that might classify the data. One reasonable choice as the best hyperplane is the one that represents the largest separation, or margin, between the two classes. So, we choose the hyperplane such that the distance from it to the nearest data point on each side is maximized. Naïve Bayes Classifier Naive Bayes algorithm, in particular is a logic-based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets. The foundation pillar for naive Bayes algorithm is the Bayes theorem which states that in a sequence of events, if A is the first event and B is the second event, then P(B/A) is obtained by the expression, P(B/A) = P(B) * P(A/B) / P(A) The reason that Naive Bayes algorithm is called naive is not because it is simple (naïve). It is because the algorithm makes a very strong assumption about the data having features independent of each other. In other words, it assumes that the presence of one feature in a class is completely unrelated to the presence of all other features. If this assumption of independence holds, naive Bayes performs extremely well and often better than other models. Mathematically,
• 3. 𝑃(𝑋 , … , 𝑋 /𝑌) = 𝑃(𝑋 /𝑌) In order to create a classifier model, we find the probability of a given set of inputs for all possible values of the class variable Y and pick up the output with maximum probability. This can be expressed as 𝑌 = 𝑎𝑟𝑔𝑢𝑚𝑎𝑥 𝑃(𝑌) 𝑃(𝑋 /𝑌) Neural Networks A neural network is a series of algorithms that endeavours to recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. The basic computational unit of the brain is a neuron. In comparison, a ‘neuron’ in a neural network also called a perceptron is a mathematical function that collects and classifies information according to a specific architecture. The perceptron receives input from some other nodes, or from an external source and computes an output. Each input has an associated weight (w) which is assigned on the basis of its relative importance to other inputs. The node applies a nonlinear function to the weighted sum of its inputs to create the output. The idea is that the synaptic strengths (the weights w) are revisable based on learning from the training data which in turn controls the strength of their influence and direction. The learning happens in two steps: forward propagation and back propagation. In simple words, forward propagation is making a guess about the answer and back propagation is minimising the error between the actual answer and guessed answer. The process of updating the input signals is continued through multiple iterations to arrive at a decision. K Nearest Neighbour Technique K-nearest neighbours (KNN) is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). A case is classified by a majority vote of its neighbours meaning the case being assigned to the most common class amongst its K nearest neighbours measured by a distance function. Below is step by step procedure to compute K-nearest neighbours. 1. Determine parameter K=number of neighbours to be used. 2. Calculate the distance between the query-instance (item to be identified as belonging to a preidentified category) and all the training samples. 3. Sort the distance and determine nearest neighbours based on the Kth minimum distance. 4. Gather the category 𝛾 of the nearest neighbours 5. Use simple majority of the category of nearest neighbours as the prediction values of the query instance. The most intuitive nearest neighbour type classifier is the 1-nearest neighbour classifier that assigns a point x to the class of its closest neighbour in the feature space. Finally, the choice of particular classifier for a given situation would depend on their relative performance in respect of accuracy, sensitivity and specificity. There are deeper issues involved in the use of all these techniques and considerable developments have taken place in both theory and programming related to the topic. --- Jayaraman