### working with python

• 1. Working With Python Algorithm Implementations In Python The algorithms involved in machine learning and data science has two vital types of implementation: • Classification • Regression We will study and analyze some algorithms from both these types and understand how they accelerate the process of nurturing the data and bring important insights from them. Linear Regression Linear regression comes under predictive analysis and is used to find the relationship between two variables. These two variables are the target variable and the predictor variable. The dependent variable is the target variable and the independent variable is the predictor variable. Both of these variables are features that already exist in a dataset. The overall concept of regression is to check two things- does the given group of predictor variables do a satisfactory job in predicting the dependent variable? And which variables, in particular, are the real predictors of the dependent variable, and what is the impact the outcome variable? Linear regression is represented by a simple equation- Y = b*x+c Where Y equals to a dependent variable, b is the regression coefficient, x is the slope and c is the constant. The Line of Best Fit The line of best fit is a line which demonstrates the correlation between the observed or actual values against the predicted ones. After applying the linear regression algorithm to our data, we use this line to check how close the predicted values are to the actual ones. It helps in reducing the distance between both those values also pronounced as the error values. They are also referred to as residuals. These residuals are symbolized by the vertical lines showing the comparison between the predicted and actual values.
• 2. For example, we can see that the weight of a person increases with an increase in their age. Therefore, the blue line represents our line of best fit which is also known as the regression line. For calculating the distance between the line and the points, we need the following formula SS(residual)= ∑[h(x)-y]^2 where h(x) is the predicted value and y is the actual value The Cost Function Let us consider an example to understand this case. A sales department of a company is planning to invest some capital to increase its sales in the next 6 months. But, they couldn't hit their targets and had to incur some loss. Hence, to minimize that loss, we use the cost function. This cost function is applied to represent and calculate the error of the model. Therefore, cost function, J(Θ0, Θ1) = 1/2m∑[h(x)-y]^2, where and x is the number of rows in the training set. Gradient Descent Gradient Descent is yet another important term which is used to find the minimalistic cost of a function or an equation. It is by far the best optimization algorithm incorporated in machine learning and deep learning. Based on a convex function, this descent makes some small tweaks and changes to its parameters iteratively in order to minimize a given function to a local minimum if possible. Gradient Descent can be imagined as climbing down to the bottom of a mountain, instead of climbing up. This is because it is a minimization technique used to minimize a given local function. Code in python
• 3. # Importing the necessary libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Retrieving the dataset dataset = pd.read_csv('Salary_Data.csv') x = dataset.iloc[:, :-1].values y = dataset.iloc[:, 1].values # Splitting the dataset into the training and test set from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 1/3, random_state = 0) # Performing feature scaling from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() x_train = sc_x.fit_transform(x_train) x_test = sc_x.transform(x_test) sc_y = StandardScaler() y_train = sc_y.fit_transform(y_train) # Fitting the Simple Linear Regression model to the Training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(x_train, y_train) # Test set results prediction y_pred = regressor.predict(x_test) Logistic Regression Logistic regression is a field of statistics that come under classification rather than regression. Like all regression techniques, the logistic regression comes under predictive analysis theory of implementation. Logistic regression is used to describe the structure of data and explain the correlation between a dependent binary variable and one or more nominal independent variables. It is favorable for predicting binary outcomes as 1/0 or yes/no or true/false considering the kind of dataset given and the output required. Logistic regression can also be considered as a special case of linear regression when the outcome variable is categorical, where we are using log of odds as the dependent variable. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. This type of regression can be characterized by probabilities of following events- Odds = p/(1-p) = probability of event occurring/probability of event not occurring Ln (odds) = ln (p/(1-p)) Logit (p) = ln (p/(1-p))
• 4. In this, (p/1-p) is the odds ratio. If the log of the odd-oriented ratio is positive, the probability of success rate will always be higher than 50%. A typical logistic model plot is shown below. It is observed that the probability never goes below 0 and above 1. We can check the performance of this regression by testing it through the following parameters. Akaike Information Criteria- AIC is the measure of fitness which can penalize a model for the frequency of its model coefficients. Therefore, we always prefer the model with minimum Alkaline Information criteria value for better results. Null Deviance- Null Deviance represents the outcome predicted by a model with the help of the intercept. It all depends, if the null deviance is less, then the model will be better. Residual Deviance- Residual deviance describes the response predicted by a model on the addition of independent variables. Same goes for residual deviance, lower the value, better the results. Confusion Matrix- Confusion matrix is the tabular representation of actual vs predicted values. It helps in finding the performance of a machine learning model, either classification or regression and avoids overfitting. Predicted Values Actual Values True Positive False Positive False Positive True Negative The accuracy of a model can be calculated by True Positive(TP) + True Negative(TN) True Positive(TP) + True Negative(TN) + False Positive(FP) + False Negative(FN)
• 5. ROC curve Receiver operating characteristic curve or ROC curve signifies how well the model can distinguish between two things by plotting the true positive rate with the false positive rate. Good models will be able to accurately distinguish between the two. Whereas, a poor model will have difficulties in differentiating between the two. Code in python # Importing the necessary libraries import numpy as np import pandas as pd from pandas import Series, DataFrame import scipy from scipy import stats from scipy.stats import spearmanr # Retrieving the dataset t1= 'C:/Users/ml/datasets/train.csv' train=pd.read_csv(t1) t2= 'C:/Users/ml/datasets/test.csv' test=pd.read_csv(t2) x= train.iloc[:, [2,4,5,6,7,9]].values y= train.iloc[:, 1].values # Splitting the dataset into the training and test set from sklearn.cross_validation import train_test_split x_train, X_test, y_train, y_test= train_test_split(x,y,test_size= 0.25, random_state=0) # Performing feature scaling from sklearn.preprocessing import StandardScaler sc_x=StandardScaler()
• 6. x_train=sc_x.fit_transform(x_train) x_test=sc_x.transform(x_test) # Fitting the Simple Linear Regression model to the training set from sklearn.linear_model import LogisticRegression classifier= LogisticRegression(random_state = 0) classifier.fit(x_train,y_train) # Test set results prediction y_pred=classifier.predict(x_test) # Creating the Confusion Matrix from sklearn.metrics import confusion_matrix cm=confusion_matrix(y_test, y_pred) Support Vector Machines Support Vector Machines(SVMs) are used to find the best hyperplane in an array of data points that will best suit the results in a supervised learning environment. Suppose we have got two columns x and y and they consist of some random data-points. These points are plotted in a two-dimensional plane. Our motive is to derive a line that is going to separate these points. The line that separates these points horizontally, vertically or diagonally is known as a hyperplane. This hyperplane calculates the distance between the data points and itself to determine the appropriate hyperplane which will enable in classifying these points. This distance is known as margin. SVM supports both regression and classification tasks and can tackle multiple continuous and categorical variables. For categorical variables, a dummy variable is created with case values as either 0 or 1. Thus, a categorical dependent variable consisting of three levels, say A, B, C, is represented by a set of three dummy variables
• 7. A: {1 0 0} B: {0 1 0} C: {0 0 1} As we all know how to identify a hyperplane, the question is how to identify the right one? We can reach a conclusion by considering the following cases. CASE 1 There are three hyper-planes in our n-dimensional space which are x1, x2, x3. We need to identify the right hyperplane between the three. X1 and x3 are traversing between the points while x2 is separating these points in a perfect fashion. Hence, x3 is our ideal hyper-plane. CASE 2 The three hyperplanes x1, x2 and x3 are segregating the points quite well as they are all vertical and parallel to each other. So, how can we identify the right hyperplane in this situation? x1 and x3 are planes which are nearer to the points that mean their margins are quite small compared to x3. Hence, x3 is having more margin and hence it is the ideal hyperplane. CASE 3 In the third case, all the points are residing very close to each other in the center of the plane with little or no room for the hyperplane to pass between them. What can we do in such a case? This problem can be dealt with by adding a third axis, the Z-axis! As z is x^2 + y^2, all the values for z will be positive as z is the squared sum of both x and y. Sometimes, this trick won't be applicable to this type of scenario. Hence, kernel trick comes into play for such scarcity. It converts the not so separable problem(the scenario discussed above) to a separable problem. These functions are called kernels. They are useful in non-linear separation problem. Simply put, it does some extremely complex data transformations, then finds out the process to separate the data based on the labels or outputs that have been defined. Code in python # Importing the important libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Retrieving the dataset dataset = pd.read_csv('Social_Network_Ads.csv') x = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the training and test set from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0) # Performing Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() x_train = sc.fit_transform(x_train) x_test = sc.transform(x_test)
• 8. # Fitting the SVM model to the training set from sklearn.svm import SVC classifier = SVC(kernel = 'linear', random_state = 0) classifier.fit(x_train, y_train) # Test set results prediction y_pred = classifier.predict(x_test) # Creating the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) Decision Trees Decision trees are the most preferred and favored machine learning classification technique in machine learning. It not only helps us with the prediction analysis but also is a very efficient algorithm to understand the characteristics of various variables. They come under the supervised learning algorithm consisting of a predefined target variable which is to be determined. This is suited for both categorical as well as continuous variables in the output. The basic functioning of decision trees goes this way- there are a set of points that are plotted on a plane. These points can’t be separated easily by a line due to their heterogeneous properties. Hence, decision trees divide these points into different clusters or leaves based on some predefined criteria and take care of them individually. There are two different types of decision trees which are classified based on the type of target variable we have taken. Binary Variable Decision Tree- The decision tree which has a binary target variable is known as Binary Variable Decision Tree. In this case, the output will be either “yes” or “no”. Continuous Variable Decision Tree- The decision tree which has a continuous target variable is known as Continuous Variable Decision Tree. In this case, the output will be any recurring value such as the salary of a person. Let us go through some of the key terms commonly used in decision trees. Root Node- It represents the entire population or the given sample and further gets divided into two or more homogeneous sets. Splitting- It enables the division of a node into two or more sub-nodes. Decision Node- This is like sub-node splitting into further sub-nodes. Leaf/Terminal Node- These are nodes with zero sub-nodes, that is, these nodes can’t be split further. Pruning- When the size of the decision trees is reduced by removing nodes, the process is called pruning.
• 9. Branch/Subtree- A subsection of a decision tree is called as a branch or a sub-tree. Parent and Child Node- A node which is divided further into small sub-nodes is called a parent node of whereas sub-nodes are the children of this parent node. There are some important terms that we first need to understand before we can implement decision trees in python. Impurity Impurity is the measure of unknown or redundant data which is evident when there are traces of one class into another. There are reasons for its existence. The decision tree can run out of classes to divide the class any further. We have assumed that we can allow some percentage of impurity in our data for better performance which will introduce the impurity into our humble model! Entropy Entropy is the degree of redundancy of elements or in other terms, it is a measure of impurity. Mathematically, it can be calculated with the help of probability of the items as: H= -Σp(x)*log[p(x)] It is the negative summation of probability times the log of the probability of item x. Information Gain Information gain is the main ingredient that is instrumental in the construction and setting up of a decision tree. Constructing a decision tree from scratch is all about finding the attribute that will return the highest information gain in order to produce maximum accuracy in the decision trees. Therefore, IG is equal to entropy(parent) - (average weights) * entropy(children) Code in python # Importing the important libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Retrieving the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the training and test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) # Performing Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
• 10. # Fitting the Decision Tree Classification model to the Training set from sklearn.tree import DecisionTreeClassifier classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Test set results prediction y_pred = classifier.predict(X_test) # Creating the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) Random Forest Random Forest algorithm is another approach to supervised classification algorithm after decision trees. It is like the proper extension to decision trees algorithm. There is a correlation between the number of trees in the forest and the results it calculates, hence. higher the frequency of trees, the better and accurate will be the result. Random forests are equivalent to ensemble learning technique for classification and regression techniques. Random forest avoids the problem of overfitting by taking care of the fact that there are enough trees in the model. Another advantage is that the classifier of random forests can easily manage missing values. It can also be modeled for categorical values. Working Working of the random forest depends on 2 stages- one is creating a random forest and the other is making predictions and extracting useful observations from the random forest classifier created in the first stage. These are some of the steps used in the creation of random forests. • We need to select some random “k” features out of the total “m” features where k is less than m. • Among the selected “k” features, we need to calculate a node “d” applying the best split point. • We need to split the node into further nodes using the derived best split. • Steps 1, 2 and 3 must be repeated until some “l” number of nodes has been achieved. • Construct the forest by re-applying steps 1 to 4 for “n” number of times to create “n” number of trees. Applications Stock market- A random forest can be used to identify the right stock which can attract profits for the user at most times. E-commerce- It can be effective in this field by predicting the products which the customer can buy in future, based on their past choices.
• 11. Banking- It can recognize the defaulters and the non-defaulters by analyzing the behavior of the customer through their past records. Code in python # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') x = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the dataset into the training and test set from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() x_train = sc.fit_transform(X_train) x_test = sc.transform(X_test) # Fitting the Random Forest Classification model to the Training set from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 7, criterion = 'impurity', random_state = 0) classifier.fit(x_train, y_train) # Predicting the Test set results y_pred = classifier.predict(x_test) # Creating the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) K-means clustering Clustering is the process of classifying the given data points into a number of groups or classes such that the data points in the same groups are compatible with each other in terms of features and characteristics. In simple words, k-means has the modus operandi of segregating points into groups with similar properties and assign them into clusters. Working It starts with specifying the desired number of clusters ‘k’ required, let’s consider k as 2 for the five random data points in 2-D space.
• 12. Then, we need to randomly assign each data point to a cluster. We will assign three points in cluster 1 as shown in red color and two points in cluster 2 as shown in grey color. Next, we need to compute centroids for these clusters, the centroid of data points in the red cluster is signified by a red cross while for the grey cluster, it is shown using a grey cross.
• 13. Then comes the step of re-assigning each individual data point to the closest cluster centroid. The data point which is at the bottom is assigned to the red cluster even though it is closer to the centroid of the grey cluster. Hence, we assign that data point into the grey cluster. In the end, we need to recompute cluster centroids- We have to recompute the centroids for both the clusters.
• 14. Feature engineering is the process of using the domain knowledge and expertise to choose which data variables to input as features before building a machine learning model. Feature engineering plays a key role in k-means clustering; using meaningful features that capture the variability and essence of data is essential before imputing the selected features for applying k- means. Feature transformations are conducted, particularly to represent rates rather than measurements, which help in normalizing the data. At times, it is observed that this engineering might help get rid of 80% of the error in a dataset. It proves to be effective in maintaining the accuracy of machine learning model that is implemented to have great insights from the data. Code in python # Importing the required libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Retrieving the dataset dataset = pd.read_csv('customers.csv') x = dataset.iloc[:, [3, 4]].values y = dataset.iloc[:, 3].values # Splitting the dataset into the training and test set from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0) # Performing Feature Scaling from sklearn.preprocessing import StandardScaler sc_x = StandardScaler() x_train = sc_x.fit_transform(x_train) x_test = sc_x.transform(X_test) sc_y = StandardScaler() y_train = sc_y.fit_transform(y_train)
• 15. # Finding the optimal number of clusters from sklearn.cluster import KMeans for i in range(1, 11): kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) kmeans.fit(X) wcss.append(kmeans.inertia_) # Visualising the results using plots plt.plot(range(1, 11), wcss) plt.title('The Elbow Method') plt.xlabel('Number of clusters') plt.ylabel('WCSS') plt.show() # Fitting the K-Means algorithm to our dataset kmeans = KMeans(n_clusters = 10, init = 'k-means++', random_state = 55) y_kmeans = kmeans.fit_predict(x) K-nearest Neighbor(K-NN) K-nearest neighbor can be considered for both classification and regression problems. A KNN model is taken into consideration when n number of points need to be classified into groups that contain data-points or in this case, features of a dataset, in a homogeneous way. These data- points are all similar to each other and are together. When a new point is introduced in the plane, it is classified based on its characteristic which matches any homogenous group or class. It is a non-parametric approach meaning it doesn't depend on data to establish a normal distribution It is also referred to as lazy classification model which predicts classes based on the features of observations that are matching. Selecting the number of nearest neighbors, that is, selecting the value of k, plays a significant role in calculating the capacity of our model. Selection of k will determine how well the data can be used to characterize the results of the kNN algorithm. A large k-value will generally tend to reduce the variance in data due to the noisy data; which will develop a bias. This might lead to smaller patterns in data which can be fruitful. There are many data points in the plane whose distance can be calculated by the following techniques. Euclidean Distance: Euclidean distance is calculated to be the square root of the sum of the squared differences between a new point (x) and an existing point (y). ED= √Σ(x^2-y^2) Manhattan Distance: Manhattan distance is the distance between vectors using the sum of their absolute difference. MD= Σ|x-y|
• 16. Hamming Distance: It is in favor of categorical variables. If the value (x) and the value (y) are same, the distance D will be equivalent to zero. HD= Σ|x-y| Where x=y when D=0 and x≠y when D=1 KNN is mostly used for searching purposes. It enables the search by finding the nearest item to the customers' interests. It can also be implemented for building Recommender Systems. It will find similar items based on the users personal taste or preference. Normally, the KNN algorithm is not preferred much when compared to SVM or neural networks as it runs slower compared to other algorithms. Code in python # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') x = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the data into the training and test set from sklearn.cross_validation import train_test_split x_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() x_train = sc.fit_transform(X_train) x_test = sc.transform(X_test) # Fitting our K-nearest neighbor model to the Training data from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'hamming', p = 2) classifier.fit(x_train, y_train) # Test data result prediction y_pred = classifier.predict(x_test) # Creating the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) Naive Bayes Naive Bayes is a basic technique for building classifiers. These models assign class labels to problem instances, represented as vectors of features. It is a part of classification techniques
• 17. based on Bayes’ theorem with the assumption that there exists independence between predictor variables. In plain terms, a naive Bayes classifier calculates the probability of the outcome assuming that the presence of a defining feature in a class is not at all related to the presence of any other feature in another class. For instance, a knife may be considered to have features like sharpness, being made of stainless steel and a size of 20 inches. These features do not depend on each other for their existence. Similarly, a naive Bayes approach would take into account all of the properties of each variable to independently contribute to their probability. Naive Bayes classifiers need to be trained effectively in a supervised learning setting for different sorts of probability models. In many practical applications, parameter estimation for naive bayes models depends on the execution maximum likelihood, which mean that one can work with the naive bayes model without calculating the bayesian probability or using any appropriate Bayesian methods. P(c/x)= P(x/c)*P(x) P(x) where, P(c|x) is called the posterior probability of target given predictor which is x(features), P(c) is known prior probability of class, P(x|c) is the likelihood, which is the probability of predictor given class and P(x) is the prior probability of predictor. Code in python # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('SN_Ads.csv') x = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values # Splitting the data into the training and test set from sklearn.cross_validation import train_test_split x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) # Performing Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() x_train = sc.fit_transform(x_train) x_test = sc.transform(x_test) # Fitting the Naive Bayes model to the Training data from sklearn.naive_bayes import GaussianNB classifier = GaussianNB()
• 18. classifier.fit(x_train, y_train) # Predicting the Test set results y_pred = classifier.predict(x_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred)