Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

machinelearningengineeringslideshare-160909192132 (1).pdf

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 60 Publicité

Plus De Contenu Connexe

Plus par ShivareddyGangam (20)

Publicité

Plus récents (20)

machinelearningengineeringslideshare-160909192132 (1).pdf

  1. 1. APPLICATIONS IN MACHINE LEARNING Joel Graff, P.E. Image source: Wikipedia
  2. 2. Overview  Neural Networks (How Machines Can Learn)  Collecting & Modeling Data (How to do machine learning)  Case Studies (What it really looks like)  So What? How do I really use this? In order to full appreciate the many components of machine learning, we will explore the topic in four key areas.
  3. 3. • Regression (estimating or predicting a real value) • Classification (classifying as true/false or 1-of-n classes) • Optimization (scheduling / process analysis) • Performance assessment Types of Applications Image by Julian Nitzsche, August 2007, CC BY-SA-3.0 (modified) https://commons.wikimedia.org/wiki/File:Visegrad_Drina_Bridge_1.jpg Each problem type is substantially unique and requires a specific approach. In many cases, unique machine learning algorithms and techniques have been developed to address these problem types.
  4. 4. Neural Networks “The Neural Network” by rajsegar CC License BY-NC-ND http://rajasegar.deviantart.com/art/The-Neural-Network-177904377
  5. 5. Neural Networks Simulates Brain physiology • Neurons and synapses • Pattern recognition Classification / Regression • Disease classification • Stock price prediction Noisy / Complex data • Missing, incorrect, or irrelevant information • Linear / non-linear
  6. 6. Regression Classification Problem Type / Complexity Matrix It is important to understand the relationship between problem type and problem complexity in the early stages of a machine learning problem Key questions: • Is the data directly related? • Are we trying to estimate a real value or just a yes/no or categorical classification?
  7. 7. Regression Classification Problem Type / Complexity Matrix Non-Linear Regression is required where the relationship between the variables cannot be approximated with a straight line. Data sets often fall into this category. Linear Regression (“line of best fit”) Arguably, the simplest problem to solve. The variables are linearly (directly) related, allowing them to be solved using common linear algebra techniques.
  8. 8. Regression Classification Problem Type / Complexity Matrix Classification has one key distinction from regression. Where regression seeks a line which best fits the data, classification seeks a boundary which best separates the data Non-linear classification problems may have unique characteristics, like one class which is entirely contained within another. Here, a single straight line cannot separate them - a curved boundary is required.
  9. 9. Artificial Neural Network (Linear) x1 x2 y Wx1 Wx2 This simple visual structure represents a basic, linear neural network. It contains the key components of it’s biological forebear, neurons and their synaptic interconnections. One can also look at this as the visual representation of a mathematical function, y = f(x) with the weighted connections (Wx1 and Wx2) describing the relationship between x and y
  10. 10. Artificial Neural Network (Linear) x1 x2 y Wx1 Wx2 Input Layer Output Layer Fully-connected feed forward network Fully-connected The neurons in each layer are connected to every neuron in the following layer Feed-forward Computation begins at left with the input and terminates at right with the result
  11. 11. Artificial Neural Network (Linear) x1 x2 y Wx1 Wx2 The equation expressed in summation notation may be literally described as: “The sum of the inputs multiplied by their weighted connections, then divided by the total number of inputs” In other words, it’s the description of a weighted average.
  12. 12. 2 X 1 = 2 4 X 1 = 4 3 =1 =1 2 4 Artificial Neural Network (Linear) x1 x2 y Wx1 Wx2 Simple Average (2 + 4) / 2
  13. 13. 2 X 1 = 2 4 X 1 = 4 3 =1 =1 2 4 Artificial Neural Network (Linear) x1 x2 y Wx1 Wx2 Simple Average (2 + 4) / 2 Calculating a simple average is straight forward. With weights given a value of 1, we are effectively saying that every value in the average is equally important. The end result will always be the simple average of any two inputs provided at left. But suppose we want to do something more interesting? Suppose we want to calculate the sum, rather than the average? To do that, we only need to change the value of the weights…
  14. 14. 2 4 2 X 2 = 4 4 X 2 = 8 6 =2 =2 Artificial Neural Network (Linear) x1 x2 y Wx1 Wx2 Summation (4 + 8) / 2
  15. 15. 2 4 2 X 2 = 4 4 X 2 = 8 6 =2 =2 Artificial Neural Network (Linear) x1 x2 y Wx1 Wx2 Summation (4 + 8) / 2 It’s easy to see how a neural network’s behavior can be drastically affected by the value of the weighted connections between the neurons (just as the human brain’s synapses modulate electrical pulses between neurons to affect different results). But most problems are too complex to guess the weights beforehand or determine them by trial and error. If only we could get the machine to figure out the weights for itself…
  16. 16. Supervised Learning Feed Forward Back Propagate x1 x2 y Wx1 Wx2 Prediction Expected Result Error - Supervised Learning is a technique used to find the optimum weights required to fit a dataset using the error between the prediction and the desired answer. It consists the feed forward and back propagate phases. Back Propagation is the process by which a network takes the error between it’s prediction and the expected result and adjusts it’s weights to achieve a better prediction.
  17. 17. Supervised Learning Update Rule Increase / decrease weights by prediction error Convergence Network error minimizes, weights stabilize Feed Forward Back Propagate Through many iterations, supervised learning adjusts the network weights using a pre-determined update rule with the hope of achieving network convergence, typically indicated by monotonically decreasing error that eventually plateaus.
  18. 18. Supervised Learning As a simple (trivial) example, examine the table below. In this case, Supervised Learning is used to “train” a network designed to calculate simple averages (weights = 1) to learn to calculate sums instead (weights = 2). The update rule simply adjusts the weights up or down by the percentage of error from the previous iteration. We can see how the network quickly converges on the optimal weights, producing an accurate prediction with correspondingly low error.
  19. 19. Hidden Layer Artificial Neural Network (Non-Linear) h1 h2 y x1 x2 The real power of a neural network, however, is in it’s ability to handle non-linear problems. Here, we see a non-linear network’s key feature: the hidden layer. Note that each layer is fully connected to the next.
  20. 20. Artificial Neural Network (Non-Linear) h1 h2 y x1 x2 To better understand how the non-linear network operates, it helps to conceive of it as a composition of several linear networks, each following the same rules for calculation and updating.
  21. 21. Artificial Neural Network (Non-Linear) h1 h2 y x1 x2 A non-linear network may have any number of hidden layers and any number of neurons in each layer. Practice shows, however, that most problems do not require more than one hidden layer, with no more neurons than there are in the input or output layers. It can be mathematically demonstrated that a neural network of sufficient complexity Is capable of modelling any complex mathematical function. This strength makes them ideally suited for modelling the indirect, subtle relationships In a data set to provide accurate predictions where human experience fails.
  22. 22. In 2012, researchers at Google Brain created a network of 16,000 computer processors with over 1 billion connections.
  23. 23. They then trained this network by showing it screen captures from 10 million randomly selected YouTube videos over three days.
  24. 24. ? At the end of the experiment, researchers discovered the network was able to recognize two things in particular. Can you guess what they were? ?
  25. 25. At the end of the experiment, researchers discovered the network was able to recognize two things in particular. Can you guess what they were? Image source: www.twitter.com/realgrumpycat Image source: http://www.chroniclelive.co.uk/ People Cats
  26. 26. What made this experiment unique was that the researchers used unsupervised learning, allowing the network to determine for itself the difference between images, rather than telling it in advance what it was looking at (supervised learning). Image source: www.twitter.com/realgrumpycat Image source: http://www.chroniclelive.co.uk/
  27. 27. Data collection & Modeling Big-data_conew1 by luckey_sun, CC License BY-SA www.flickr.com/photos/75279887@N05/6914441342
  28. 28. Data Sets • {3.14159, 1.333, 42.0} Numeric • {Atlanta, Dallas, Chicago} Unordered Categorical • {Low, Medium, High} Ordered Categorical • What we’re trying to predict Target • Describes the characteristics of the dataset Features / Predictors
  29. 29. Data Sets • What we’re trying to predict Target • Describes the characteristics of the dataset Features / Predictors When determining data and it’s roles, it’s important to remember that relationships between pieces of data may be complex. For example, to predict housing prices a housing market, it would be necessary to visit a sample of houses and record their listing price (the target value). Since the listing price is often driven by a series of features of the house (square footage, number of bathrooms, etc.), we need to record that data as well. Further, which features to choose can be tricky as their relationship to the target value may be complex. Further, some features may have relationships to other features, an undesirable dynamic to be avoided.
  30. 30. Data Sets • {3.14159, 1.333, 42.0} Numeric • {Atlanta, Dallas, Chicago} Unordered Categorical • {Low, Medium, High} Ordered Categorical • What we’re trying to predict Target • Describes the characteristics of the dataset Features / Predictors Categorical data can be somewhat complex to manage. It is key to note whether or not your categorical data is ordered, as it can substantially affect model accuracy. Preparing text categorical data (as opposed to numbered categories), also requires there be no alternate spellings or misspellings.
  31. 31. Data Sources Data collection can be very time consuming! Data set sizes: • 10 – 100 million • 500 – 10,000 typical The R and Python languages are well-suited for retrieving and managing data. “He who has the most data, wins.” Data Set Web Spreadsheet Databases Paper / Other
  32. 32. Data Preparation  Clean Data • No missing / incorrect values • No misspelled categorical values • No mixed data types  Tabular Layout • Features and targets in columns • Each row is an “observation” • Avoid duplicate records
  33. 33. Data Preparation Normalization • Values may vary by several orders of magnitude • Larger values have greater influence • Normalization constrains feature value ranges to the same values. • [0,1] and [-1,1] are common ranges. • Generally, ~[-3, 3] is acceptable.
  34. 34. “Prediction is very difficult, especially about the future.” - Niels Bohr Prediction Image source: http://info.iqms.com/IQMS-Manufacturing-erp-expertise/bid/102603/IQMS- Quality-Assurance-Predictions-With-Carnac-the-Magnificent
  35. 35. Steps: 1. Split original data set into train and test sets (80/20). 2. Train the model with the larger portion 3. Predict with both the training and testing data. 4. Measure the error in the predictions in both data sets 5. Compare the error of the two data sets Cross-Validation Cross-Validation Establishes how well a model “generalizes” Generalization The ability to accurately predict using previously-unseen data
  36. 36. Good fit • The network generalizes well on data it has not seen • Performance on both data sets is similar • Overall error is low Underfit (high bias) • Does not predict well on either data set • Need more data, features, better algorithm Overfit (high variance) • Predicts well on the training data, but not the testing data • Need fewer features, less-powerful algorithm Cross-Validation
  37. 37. Measure of Success A measure of success is: A meaningful, context-specific statement of how successfully the model predicts. “On average, the model predicts within __% of the actual value, __% of the time.”
  38. 38. Measure of Success Measures of success allow us to articulate in a straightforward, simple fashion, the accuracy of the machine learning model without having to resort to technical jargon that may confuse laypersons. Articulating a measure of success also helps us understand just how accurate the model needs to be for the intended purposes.
  39. 39. Case Studies
  40. 40. Overview Compressive Strength of Concrete Samples Image source: http://info.admet.com/blog /topic/compression-test Given a concrete sample’s mix design and age, can we accurately estimate is compressive strength?
  41. 41. Data Profile  Source: University of California, Irvine (UCI) website  1,030 samples (metric units)  Non-Linear  Features: 1. Cement 2. Slag 3. FlyAsh 4. Water 5. Superplasticizer 6. Coarse Aggregate 7. Fine Aggregate 8. Age (days)
  42. 42. Predictions Actual compressive strengths of all samples, sorted from lowest to highest Upper accuracy tolerance (110% of actual) Lower accuracy tolerance (90% of actual)
  43. 43. Predictions Shape of curve suggests data distribution is approximately Normal / Gaussian (standard bell curve), evidenced by steeper slope at lower and higher values (fewer data points) and flatter slope in the midrange (more data points)
  44. 44. Predictions Generalized Linear Regression (train) Not surprisingly, Linear Regression performs poorly on this (non-linear) data set.
  45. 45. Predictions Generalized Linear Regression (train) A plot of Linear Regression’s prediction success vs. error reveals that even where the error was relatively low in the middle-strength range, there was little consistency in it’s success rate.
  46. 46. Predictions Random Forest (training) Support Vector Machine (training)
  47. 47. Predictions Random Forest (training) Support Vector Machine (training) Applying two non-linear algorithms (a Random Forest Network and a Support Vector Machine) yielded much more favorable results, performing very well against the 90% success metric. While these algorithms are not neural networks (and are structurally unrelated), many of the same rules that apply to neural network training also apply here. Note the characteristic spike in the lower end of the predictions of both algorithms. This is likely due to one or a small number of erratic data points. A more detailed investigation into the data set would, hopefully, identify the cause and perhaps suggest changes that could improve the model accuracy.
  48. 48. Predictions Random Forest (training) Support Vector Machine (training) Also note the error / success plots for the two algorithms. Not surprisingly low success rates occur where the least data is (low and high ends). However, it’s interesting to note that Random Forest had less error with low-strength predictions, whereas Support Vector Machine did better with high-strength predictions. This is a common characteristic of machine learning algorithms – each algorithm looks at the data a bit differently. Further, we can use this trait to our advantage by building ensemble networks – combinations of two good networks to create better predictions.
  49. 49. Model Results [1]Test Success: Percentage of time model is at least 90% accurate on previously-unseen data. [2]Ensemble: Combination of SVM and RF only. [3]Chained Ensemble: Predictions of one ensemble are used as inputs to another.
  50. 50. Model Results Not surprisingly, linear regression performed poorly on the test data (data not previously seen during training). SVM / RF performances were markedly better, though far too low to be useful as production models. The ensemble of the RF and SVM simply computed the average of their predictions. It is proven that a simple average of two or more well-performing models can outperform those models. While we did not achieve that here, it is interesting to note that the ensemble was not substantially worse than the better model.
  51. 51. Model Results The final model, the “chained ensemble” consisted of two ensembles (two sets of RF and SVM) linked in series. The first ensemble simply computed the average of each network’s predictions. The second ensemble took the original data set and added the first ensemble’s predictions as an extra “feature”. This amounted to given the second ensemble a “cheat sheet” of the patterns discovered by the first, enabling it to substantially outperform any of the other algorithms and achieve an 87% success rate.
  52. 52. Feature Importance Feature Importance provides a way of preprocessing a data set to help control variability. The depicted binary decision tree takes the data set and splits it in two at the point where the most variation occurs. In this case, we see that the age of the samples (specifically at 21 days) is the first node in the tree. This single observation alone could substantially improve network performance, if we split our data set into two (separating samples at the 21-day age). The trade-off, however, is fewer data points in the resulting data sets, which makes learning patterns more difficult.
  53. 53. So What? What can this technology really do for us? The answer lies in asking the right question.
  54. 54. So What? “Given ______, can we determine _____ with _____ accuracy?”
  55. 55. So What? The question has three key ingredients: • The Givens (features / predictors) • The Goal (target / prediction) • The Accuracy (success rate) Using this format, we can take one data set (like the strengths of concrete sample data) and use it to answer a variety of unique questions
  56. 56. So What? Given the strength and mix design, can I determine the time it will take to cure with 95% accuracy? Answering this question is important for contractors and designers who are trying to determine a construction schedule or anticipate how soon a newly-constructed roadway can be opened to live traffic.
  57. 57. So What? Given the compressive strength and cure time can I determine the most valid mix design with 90% accuracy? Answering this question helps material-testing personnel identify potential causes of substandard materials and make investigations into chronic material quality issues more efficient.
  58. 58. So What? So where can I learn how to use machine learning? Resources exist all over the internet, including: • Online classes • Data repositories • Machine Learning tools and cloud-computing services
  59. 59. Additional Resources BigML.com (http://www.bigml.com) On-line machine learning and data visualization tools UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/) Wide range of data sets for machine learning applications The R Project (http://www.r-project.org/) Free scripting language for statistical computing and graphics Coursera (http://www.coursera.org) Free on-line college-level courses in technology and other topics Microsoft Azure / Amazon EC2 Cloud-computing that provides virtualization and machine learning services

×