•0 j'aime•560 vues

Signaler

Machine Learning Unit 1 Semester 3 MSc IT Part 2 Mumbai UniversityMadhav Mishra

1.3K vues•57 diapositives

- 1. 1 Nearest Neighbor and Men’s Suits Jayda Dagdelen Nishani Siriwardane Daniel Yehuda
- 2. 2 Introduction The goal of this paper is to examine the effectiveness of machine learning/prediction technologies in making a simple daily decision. One of the first decisions we make everyday is choosing what to wear. In this paper, we evaluate the effectiveness of using the well-known nearest neighbor algorithm in aiding humans to make this decision. We designed a MS Windows application which takes as input an outfit descriptor in the form of garment colors, and gives as output one of three possible ratings: bad, mediocre, or good. We focused on men’s suits, as there is more or less a strict rule-set that governs them, which we thought could perhaps be modeled by a computer. Women’s clothing tends to be less conservative in this regard. Machine Learning The problem of data classification/prediction has been one of the important elements in the growing field of Artificial Intelligence (AI) and machine learning. Everything from intelligent robots to email spam detection algorithms use some form or another of data classification, to aid in a decision making process. The problem is set up as follows: given a set of inputs that represent some data point, suggest an output (or classification) based on some knowledge-set. For example, the robot mentioned above may take some form of its current visual data as input for a learning algorithm and base its next move (i.e., left/right turn) on the output of the algorithm. Spam detection is a good example of more straightforward classification; given an email (a set of words which act as inputs), classify it on an integer scale between 0 (most probably legitimate)
- 3. 3 and 10 (most probably spam). In both cases, some algorithm with a predefined knowledge-set returns a prediction based on its input (and this knowledge). Perhaps the most important feature of any classification algorithm that falls under the realm of machine-learning is the ability to build a knowledge-base based on some training data set, or in other words, learn. In terms of the spam example, a training set would encompass perhaps thousands of emails which are pre-classified (by a human) into the different score-groups. Using a method known as supervised learning, the algorithm parses all of these input/output pairs and attempts to “learn” the function that appropriately maps the input vectors onto their corresponding outputs. Using whichever learning method, it builds some knowledge that is based on the training set and can later be applied to other unclassified datasets (vectors). Though other types of learning are possible, including transduction, which evaluates its previous experiences to learn its own bias, we will concentrate on the simple supervised learning method outlined above.1 The Nearest Neighbor Algorithm In the realm of supervised learning algorithms, there are many options. Neural Network and State Vector Machine (SVM) systems are some of the more complicated and advanced ones which have been successfully implemented and enjoy widespread use in the industry. However, another popular and often very effective classification system is the simple Nearest-Neighbor algorithm. Though quite memory-intensive, as it maintains a list of all previously-trained vectors and their classifications, it performs just as well as the others in many of its applications. Because of its simplicity, its often comparable 1 Machine learning, Wikipedia. < http://en.wikipedia.org/wiki/Machine_learning>
- 4. 4 performance, and our relatively tiny data-set, we chose to use it as our classifier. We were also interested to see just how well a simple algorithm would approximate the human “taste function.” Although the nearest neighbor algorithm has both geometric2 and classification applications, we will be concentrating on the latter one. A good example of its usage in classification is the prediction of individuals’ political party-affiliations. With input data such as age, education level, income level, and gender (all grouped together to form a d- dimensional vector) the algorithm can be used to predict the party of the person represented by the inputs. A party-affiliated point in d-dimensional space represents each person in the data set. The classifier determines the party affiliation of a person by assigning it the affiliation of its nearest neighbor.3 The following process is employed to do this: the geometric distance from a new data point input to each element of the set of classified points is calculated. The shortest distance to the new data point indicates the nearest neighbor of the data point, and the class (in this case party-affiliation) of the nearest neighbor is assigned to the new data point.4 Again, it is important to note that the knowledge-base of the nearest neighbor algorithm is no more intricate than the entire set of points that have already been classified during some previous phase. The classification of these points constitutes the training, or supervised-learning phase of the algorithm, whereas we will refer to the prediction of new points simply as the prediction phase. 2 A classic example of its usage in geometry would be emergency dispatchers. Given the location of the fire, the emergency dispatcher finds the closest firehouse from a map, and dispatches the vehicles from there. 3 Nearest Neighbor Search <http://www2.toki.or.id/book/AlgDesignManual/BOOK/BOOK4/NODE188.HTM> 4 Nearest neighbor (pattern recognition), Wikipedia. <http://en.wikipedia.org/wiki/Neares_neighbor_%28pattern_recognition%29>
- 5. 5 The prediction phase of the algorithm is the part in which the actual knowledge (database of classified vectors) that the system already has is used to make educated statistically educated guesses as to the appropriate classification of new vectors. The nature of the algorithm, however, makes this part very time consuming, as, in the brute- force implementation, some constant amount of computing time, C, must be done in comparing the new data input vector to all n vectors in the database. The result is an algorithm which requires a time that is linearly proportional to the size of the database. To combat this problem, various optimizations such as specialized trees to organize pre- classified data have been developed; these drastically reduce the number of distances needed to be computed. Such methods partition the geometric space by computing only the distances within specified limits.5 Alternative Approaches In the realm of nearest neighbor, there are a variety of other approaches/options which deserve some attention. Firstly, it is important to note that in practice, a common variant that is often employed is known as k-nearest neighbor, in which in which k number of data points are used to estimate the output of the new input data point. To highlight its effectiveness, we will examine the following example which maps 1-dimensional vectors to their classifications: Input : 0.0 1.0 1.7 2.5 3.0 3.5 4.0 5.0 6.0 7.0 Output: D D D R R D R R R R 5 Nearest neighbor (pattern recognition), Wikipedia. <http://en.wikipedia.org/wiki/Neares_neighbor_%28pattern_recognition%29>
- 6. 6 An input such as 0.6 would be classified as D, with the simple nearest neighbor algorithm. When k-nearest neighbor algorithm is applied with k = 2 or 3, it would still be classified as D. However, determining the output of an input such as 3.7 with the k- nearest neighbor algorithm is more difficult. With the simple nearest neighbor algorithm, the output would be D. When k = 2, the two closest neighbors are D and R, and do not belong to the same class. When k = 3, two out of three nearest neighbors are R, therefore the classification is R. When k = 10, all the neighbors in the set are taken into account, and the classification is R.6 Unlike in the simple nearest neighbor method, in the k-nearest neighbor method the calculation of errors becomes important as well. The value of k should be chosen such that the prediction error should be minimized. For the calculation of the prediction error, a loss function is necessary. Loss functions take the truth and the prediction as input and produce 0 when the two match, or produce large values when the truth and the prediction are far from each other.7 Though more complicated, the use of k-nearest neighbor in our implementation may have proved more effective. Another option we considered was using a slightly less conventional model, in which the system would be trained only on good outfits. In such a model, the final rating of an outfit would be some decreasing function of the measured distance from the nearest neighbor, rather than simply its classification. Yet another option is to use the traditional model, but with different weights on each of the seven items of the suit. The purpose of this would be to avoid some the problems previously outlined. 6 Kth Nearest Neighbor Classification: Introduction <http://stat-www.berkeley.edu/users/nolan/stat133/Fall04/lectures/KNN.pdf> 7 Cross Validation < http://stat-www.berkeley.edu/users/nolan/stat133/Fall04/lectures/CV.pdf>
- 7. 7 There is also the alternative of using a completely different algorithm, perhaps not even under the umbrella of machine learning. One such algorithm could rate suits based on a knowledge-set which simply described the weights of and required correlations between different elements of the suit. Such an algorithm would, for example assign a value rating the matchability between different elements of the suits (jacket/pants, shirt/tie, etc.) and then use these values in determining the score. There are of course many other alternatives both within and outside the realm of machine learning. We hoped only to scratch the surface of what we thought could be an interesting way to help humans with a simple every-day decision. Implementation As previously mentioned, we chose to limit the scope of our project to that of men’s suits. Women’s outfits have numerous varieties in terms of shape, style, color, cut and cloth - elements that would make a program that evaluates women’s outfits too complicated for a project of this size. Men’s suits are more standard in terms of shape and style consisting only of pants, socks, shoes, shirts, jackets, ties and belts. We assumed that the main criteria for the evaluation of men’s outfits is garment color; it is the most important element used by humans in determining if a set of suit elements are a good “match.” These decisions allow for the representation of almost any men’s suit simply in terms of a list of its seven garment/accessory colors. With the above in mind, the problem of assessing an outfit as bad, mediocre, or good can essentially be thought of as one of prediction. In terms of machine learning, some system could be trained on various sets of seven-color combinations, each
- 8. 8 associated with some rating (bad, mediocre, good) and then queried with new color combinations for a prediction response. As for nearest neighbor, the same methodology applies. There is however and important aspect which needed consideration – how exactly to represent each color in terms that the algorithm can understand. Although the nearest neighbor algorithm can be implemented to work with discrete data (as in the party-affiliation problem discussed earlier in which one of the inputs is race) and other types of distance measuring functions that work well with such, color is anything but that. Color is a continuous spectrum on which humans can often measure some type of distance. In other words, given three colors, we can usually group two of them as being “closest” to each other. It is this very measure of distance that the nearest neighbor algorithm relies on to match certain color groupings with others. A natural way of combating this problem is to map each of the possible colors (~16.7 million on most computers today) to a number and then use the standard Euclidean distance function as a measure of closeness. However, again who is to say what colors should be close to each other on the number line? A somewhat artificial but more logical approach is to break down each color into some other representation. In our case, we chose to represent each color as the intensity levels of the three primary colors of red, green and blue (Each of the primary colors can take 256 intensities and so adjusting them appropriately, it is possible to come up with 256^3 ≈ 16.7 million colors.). The result is a system which maps 21-dimensional input vectors (3 primary colors for each of the seven garment colors) to one of three ratings categories (bad, mediocre, good). Though this decision triples our vector size, it organizes the colors in an ordering in which, at least at
- 9. 9 some level, the distance between colors can be identified via a geometric Euclidean function. The following is a screenshot of the developed application: The window allows for the selection of a dataset (pre-classified knowledgebase) and the setting of any of the seven garment/accessory colors. The “Predict!” button runs the nearest neighbor algorithm on the 21-dimensional input vector corresponding to the chosen colors and outputs the response in the text-field (in this example, according to the knowledgebase, the given outfit is predicted to be a “good” one). The “Add Datapoint” button is used to add a combination and rating to the currently open dataset – the slider above it can be set to any of the three ratings (bad being leftmost).
- 10. 10 Methodology To train the program, the first step involved designing sets of color combinations for the suit and rating each as good, mediocre, or bad. We chose 35 outfits for each classification. The good outfits were chosen by browsing online men’s advertisements and finding the latest fashions. The mediocre outfits were created by using our own tastes to modify the good outfits to okay outfits. And finally the bad outfits were set by randomly choosing ridiculous color combinations that we thought would be tasteless. To test the success of the nearest neighbor algorithm in suit matching it was necessary to create a testing data set which consisted of outfits already rated by a human and then compare how the program rated them. The test data set consisted of thirty different outfits of which one third were bad, one third good, and the other third mediocre. These outfits were chosen by a member of our group who was not involved in training the program so that the results would not be too biased. The thirty test outfits were input into the program and the category the program assigned to each outfit was recorded. The success of the program was measured by assigning the result from the program to a score of 1 if it rated the outfit from the test data in the same category as the human assigned it, and a score of 0 if the program rated it differently than the humanly assigned category. Note that no weight was placed on how “wrong” the program was in rating the outfit. For example if the human assigned the outfit to a good category but the computer assigned it to either a mediocre or bad one, the result would receive the same score of 1 even though a computer rating of mediocre is closer to getting it “correct.” We designed two experiments to determine some factors that affected the results of the nearest neighbor algorithm. Our first experiment involved testing the hypothesis
- 11. 11 that the larger the size of the training data, the more accurate the algorithm would in predicting a “correctly” matched outfit. We trained the program with two different data sets: one consisting of 135 outfits and the other consisting of only 68. The 68 outfits were chosen by including only every other outfit from the larger training set. We then inputted the 30 test outfits with the two different size training data sets and compared the scores. See results section for results. The second experiment was, in essence, a repeat of the above experiment with one important change. As mentioned before, the decision to use the RGB color representation scheme was somewhat arbitrary – this scheme is simply the most common one used by computers and offers at least some level of color difference “measurability.” There is yet another common theme which some might say more closely models the human color perception continuum – HSL. With HSL, each color is also broken down into three numerical descriptors: Hue, Saturation, and Luminescence, which are all measured as some percentage of a maximum value. Hue and saturation describe qualitative differences between different colors, while luminescence describes the quantitative differences of their brightness.8 In the second experiment, the two same training sets (sizes 135 and 68 respectively) were converted into HSL representation. The same was done with the 30 test outfits and the training/prediction was repeated. Below are the results. 8 Color <http://encarta.msn.com/text_761577547__1/Color.html>
- 12. 12 Results Effects of the size of Training Data When the program was trained with 135 different outfits, it incorrectly categorized 36.7% of the 30 humanly categorized outfits of the time. (Incorrect denotes that the computer did not categorize the outfit as the same as the human.) Statistically, with 95% confidence, this implies that with 135 different outfits in its knowledgebase, the algorithm will incorrectly categorize the outfits 19.4%-53.9% of the time. When the program is trained on only 68 different outfits, the program incorrectly categorized the outfits 50% of the time. With 95% confidence, with only 68 outfits in its knowledgebase, the program incorrectly categorizes the outfits 32.1%-67.9% of the time. As we hypothesized, when there is less training data, the nearest neighbor algorithm is less accurate in its predictions. The more data points in its knowledgebase, the higher the chance that some new input will have a nearest neighbor that is “closer.” With too few data points, a new vector’s closest neighbor may be quite far away on the color continuum and thus be too different to trust as a member of the same classification group. Effects of a different color representation Replicating the same experiments with the HSL color representation scheme, we found only negligible performance differences. For the trial with a 135 outfit training- set, the error rate is 36.7% with a 95% confidence interval of 19.4%-53.9%. The results were identical for the trial with only 68 outfits.
- 13. 13 Conclusion of Results The results outlined above suggest that with enough data points, the nearest neighbor learning algorithm implemented does decently in terms of agreeing with another human’s classification of outfits. Though the difference in error rates between our two trials (in RGB) may not be too statistically significant, the literature on nearest neighbor, and machine learning in general, do indeed verify this conjecture. It is important to note that in our trials, the mean error rate was significantly less than 66.7%, the expected error rate of a random classifier. Furthermore, the upper bound of the confidence interval of the RGB trial with the larger dataset is still less than this number. As for the results from the HSL representation, we see no improvement on the trial with the larger dataset and only a slight improvement on the trial with the smaller dataset. We can therefore make no statistically significant conclusions on the most appropriate color representation model for use in such an algorithm. It is very possible however, that with a larger dataset and more than only one trial, some conclusions can be arrived at regarding this question. Conclusion: Discussion We have shown that the simple nearest-neighbor algorithm performs relatively well in rating what we will call the “matchability” of outfits, based solely on color. We have also demonstrated the use of an alternate color representation scheme and its effect on the algorithm’s performance. However, the question of where and under what conditions the algorithm fails still remains.
- 14. 14 In developing and testing the algorithm, we came to understand its true limitations in terms of application in the real world. These limitations stem mainly from the fact that the algorithm, in and of itself, simply does what it says it does – it finds the nearest neighbor and assigns its classification to that of the new outfit. It follows that in cases in which a new outfit matches an existing one perfectly in all dimensions except, for example, jacket color, the algorithm will most surely rate the outfit according to its almost perfect match. But, herein lies the problem – a “perfectly” matching outfit immediately goes from good to quite bad the moment the color of a major piece of the outfit, for example the jacket, is changed to a ridiculous color. The nearest neighbor algorithm inherently can not understand this and so, often fails in evaluating such outfits. With more appropriately-trained data points in the region, it might perform better. Another major problem with the algorithm is its lack of a true understanding of how humans tend to rate an outfit. Namely, it fails to weight and correlate different elements of the suit. For example, while the matching of the jacket and pants is essential, there is often much more leeway with tie color. The basic algorithm, however gives these two dimensions the same weight/importance in computing distances and so fails on this front. Nearest neighbor’s failure in correlation is best illustrated by the rating given to a very well-matched (at least by our standards) but rather colorful suit. Because our training set consists only of the more conservative/traditional suits, the algorithm ends up classifying the suit as bad.
- 15. 15 APPENDIX Nearest Neighbor Core Functions: #include "stdafx.h" #include ".nearestneighbor.h" #include <math.h> #include <queue> #include <string> #include <sstream> using namespace std; NearestNeighbor::NearestNeighbor(DataSet * d_local, int k_local) : d(d_local), k(k_local) { standardize(); } NearestNeighbor::~NearestNeighbor(void) { } /* returns the euclidian square distance between two vectors, x and y, which are assumed to be standardized already, and assumed to both have dimension of vector x */ double NearestNeighbor::distance(vector<double> &x, vector<double> &y) { double d = 0; for(int i = 0; i < x.size(); i++) { double t = x[i] - y[i]; d += t*t; } return d; //return square dist. for speed (no need for real dist. - comparison only) } void NearestNeighbor::standardize() { vector<vector<int> > & input = d->trainEx; ///??? int numAttrs = d->numAttrs; int numExs = d->numTrainExs; //record means vector<double> mean(numAttrs); //? for (int i = 0; i < numExs; i++) { for (int j = 0; j < numAttrs; j++) mean[j] += (double)input[i][j]; } for (int i = 0; i < numAttrs; i++) mean[i] /= (double)numExs; //end record means //record standard deviations stdev.resize(numAttrs); for (int i = 0; i < numExs; i++) {
- 16. 16 for (int j = 0; j < numAttrs; j++) { double t = (double)input[i][j] - mean[j]; stdev[j] += t*t; } } for (int i = 0; i < numAttrs; i++){ stdev[i] /= (double)numExs; stdev[i] = sqrt((double)stdev[i]); } //end record standard deviations //standardize data.resize(numExs); for (int i = 0; i < numExs; i++) { data[i].resize(numAttrs); for (int j = 0; j < numAttrs; j++) if (stdev[j] != 0) data[i][j] = (double)input[i][j] / stdev[j]; } //end standardize } int NearestNeighbor::predict(vector<int> &ex) { int numAttrs = d->numAttrs; int numExs = d->numTrainExs; vector<double> dex(numAttrs); for (int i = 0; i < numAttrs; i++) //standardize vector based on training stdev's if (stdev[i] != 0) dex[i] = (double)ex[i] / stdev[i]; double bestDist = distance(data[0], dex); int bestIndex = 0; for (int i = 1; i < d->numTrainExs; i++) { double dist = distance(data[i], dex); if (dist < bestDist) { bestDist = dist; bestIndex = i; } } return d->trainLabel[bestIndex]; }
- 17. 17 BIBLIOGRAPHY Cover, T. M. and P. E. Hart. “Nearest Neighbor Pattern Classification,” IEEE Transactions on Information Theory, Vol. IT-13, No.1, January 1967. Gooda, Abdel-Hamid. “Application of The Techniques of Data Compression and Nearest Neighbor Classification to Information Retrieval,” 2002. Nayar, Shree K. and Sameer A. Nene. “A Simple Algorithm for Nearest Neighbor Search in High Dimensions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 19, No.9, September 1997. Pace, R. Kelley and Dongya Zou. “Closed-Form Maximum Likelihood Estimates of Nearest Neighbor Spatial Dependence,” Geographical Analysis, Volume 32, Number 2, April 2000. Yau, Hung-Chun and Michael T. Manry. “Iterative Improvement of a Nearest Neighbor Classifier.” Color <http://encarta.msn.com/text_761577547__1/Color.html> Cross Validation <http://stat-www.berkeley.edu/users/nolan/stat133/Fall04/lectures/CV.pdf> Kth Nearest Neighbor Classification: Introduction. <http://stat-www.berkeley.edu/users/nolan/stat133/Fall04/lectures/KNN.pdf> Machine learning, Wikipedia. <http://en.wikipedia.org/wiki/Machine_learning> Nearest Neighbor Search <http://www2.toki.or.id/book/AlgDesignManual/BOOK/BOOK4/NODE188.HTM> Nearest neighbor (pattern recognition), Wikipedia. <http://en.wikipedia.org/wiki/Neares_neighbor_%28pattern_recognition%29> Nearest Neighbor Search <http://www.cs.sunysb.edu/~algorith/files/nearest-neighbor.shtml>