Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Introduction to machine learning
1. INTRODUCTION TO MACHINE LEARNING
Pruet Boonma
Faculty of Engineering
Chiang Mai University
pruet@eng.cmu.ac.th
1
http://screenprism.com/insights/article/why-does-hal-breakdown-and-become-hostile-to-humans-in-2001
2. AGENDA
▪Machine Learning Background
▪Introduction to R
▪Classification: Nearest Neighbors, Naïve Bayes, Decision
Trees
▪Forecasting with Regression Methods
▪Patterns Recognition with Association Rules
▪Clustering with K-means
▪Black box Model: Neural Network, SVM 2
3. HORIZON OF DATA ANALYTICS
▪Data Mining/Data Science
Extract patterns from data
▪Big Data
3Vs: Volume, Velocity, Variety
▪Artificial Intelligence
Machine Learning
Deep Learning
3http://www.kdnuggets.com/2016/03/data-science-puzzle-explained.html
5. ▪Artificial Intelligence was first proposed in 1950s
to construct complex machines that inhibit some
characteristic of human intelligence.
General AI: machine that have all human’s senses,
reason, intuitive, imagination and think just like us.
Narrow AI: technology that able to perform
a specific task as well as, or better than,
human cans.
5
ARTIFICIAL INTELLIGENCE: A PURPOSE
6. EXAMPLE OF NARROW AI
▪Face recognition on Facebook, image classification on Pinterest,
Spam detection on Gmail, Spell suggestion on Google.
6
Computer makes a guess on what
we really want to search
7. MACHINE LEARNING: AN APPROACH
▪Machine learning is started to flourish in 90s
as an approach in narrow AI.
▪“Ability to learn without being explicitly
programmed” – Arthur Samuel (1954)
▪“A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P,
improves with experience E” – Tom M. Mitchell (1997)
7http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html
8. EXAMPLE OF MACHINE LEARNING PROBLEM
▪We have to separate chickens into two groups, I’ll
show you how, then you can do the rest.
Task: separate chickens
Experience: I’ll show you how
Performance: separate chicken correctly
8
9. MACHINE LEARNING APPROACH
▪Instead of hard-coding
a program to perform a
task, the machine is
“trained” using large
amount of data and
algorithms to learn how
to perform a task.
9
10. TYPE OF PROBLEMS/TASKS
▪Supervised Learning
Computer is presented with training data set (example input
and desired output), the goal is to learn from that.
▪Unsupervised Learning
No training data set, computer needs to find structure from
given input.
▪Reinforcement Learning
Computer finds output from given input, then the output is
evaluated and feedback is given to the computer.
10
17. DEEP LEARNING: A TECHNIQUE
▪Human’s brain is a network
of a huge amount of small
computing devices (neuron).
▪Artificial neural network
emulates human’s brain by
create a network of simple
computing unit.
17Wikipedia
18. DEEP LEARNING
▪In 2012, Andrew Ng,
then with Google, proposed to use a huge neural
network with many layers.
Deep in deep learning means deep layers.
▪The first application for deep learning is image
recognition.
Now it’s used in many area including Go playing and
automatic cars.
18
http://www.popularmechanics.com/technology/a19863/googles-alphago-ai-wins-second-game-go/
19. HOW MACHINE LEARNS
▪In machine learning, machines makes sense of
data by creating models, then generalize the
models to make them support new data.
19
20. DATA STORAGE
▪Human has recorded data since the birth of history.
▪Electronic sensors contributes to explode of data
Wireless Sensor Network, Internet of Things, Big Data
▪The amount of data
beyond human comprehension.
▪Garbage-in, garbage-out.
20
https://www.slideshare.net/Sparkhound/spinning-brown-donuts-why-storage-still-counts
21. ABSTRACTION
▪Assigned meaning to stored data.
▪Computer summarized stored raw data using
model
An explicit description of the pattern within the data.
Math equations, trees/graphs, logical rules, clusters.
▪Typically, it depends on human to choose model.
Process of fitting a model to a dataset is training.
21https://arnesund.com/2015/05/31/using-amazon-machine-learning-to-predict-the-weather/
22. Chick
GENERALIZATION
▪The production of abstract can be limited, the idea
need to be generalized to be used in the future.
On tasks that are similar, but not identical.
Infer from the existing models to the new input.
Heuristics, i.e., educated guess, can be used to find most
useful inference.
22
infer
This is also a chick.
24. EVALUATION
▪With limited training data, bias can occurred in the
model.
Chicken needs to be yellow?
▪After trained, model is evaluated with test dataset.
To judge the generality of the model to unseen data.
Model may contain noise, i.e., variation in data.
24
25. MACHINE LEARNING IN PRACTICE
▪Machine learning process
Data collection: gathering data, in consumable format.
Data exploration and preparation: learn the character of
the data, eliminate unnecessary data, change format, etc.
Model training: human chooses algorithms to be used and
observe the result model.
Model evaluation: models are tested against test dataset.
Model improvement: change algorithm, more data, etc.
25
26. TYPE OF INPUT DATA
▪Data is expressed in unit of observation.
A set of properties of person, object, transaction, geographic
regions, time point, or measurements.
It can combines with the other, e.g., person-year, person’s
data for one year.
▪Collection of data consists of:
Examples: instances of the unit of observation
Features: recorded properties or attributes
26
27. TYPE OF FEATURES
▪Numerical : measured in numbers in some
positional numeral systems, i.e., quantitative
property.
▪Categorical/nominal: set of categories, i.e.,
qualitative property.
▪Ordinal: nominal variable with categories falling
in an ordered list, i.e., {small, medium, large}.
27
28. TYPE OF MACHINE LEARNING ALGORITHMS
▪Predictive model: prediction of one value using the other values
in the data set.
Chance of raining tomorrow?
▪Classification model: predicting which category an example
belongs.
Is this email a spam?
▪Descriptive model: insight gained from summarizing data in new
and interesting way.
What people also buy when they buy milk.
▪Clustering: identifies groups of examples with similar properties.
How many types of customers at a grocery store? 28
29. INTRODUCTION TO R
▪Open source programming language and software
environment for statistical computing.
Used by statisticians and data miners for developing
statistical software and data analysis.
29
From: Wikipedia
30. USING R
▪Start Rgui
▪Load internal data
Iris data set
▪Load external data
From csv file
▪Preview data
▪Plot the data
30
33. LOAD EXTERNAL DATA
▪We will use data from
https://www.data.go.th/DatasetDetail.aspx?id=70
49410f-5bb8-4c75-9e94-112ca18b63e2
▪Load CSV file, reformat data format to make it
consumable.
33
34. LOAD EXTERNAL DATA
▪Save to CSV file, e.g., household.csv
▪Load into R using read_csv() command from readr
library
34
File name
Assignment operator
Data set
name
35. PREVIEW DATA
▪names(): change feature name
▪head(): show feature name of couple of first
examples
▪c(…): create a list
35
37. PREVIEW
▪Data set is a matrix, df[row, col]
▪To change order of column, use c() for assign new
order
37
Empty string before comma means
that wee need all row
38. PLOT THE DATA
▪Boxplot
Distribution of data
38
http://www.physics.csbsju.edu/stats/box2.html
Empty string before comma means
that wee need all row
But only column 2 - 11
42. CLASSIFICATION
▪Group data based on minimal distance.
▪Applications
Computer vision: this animal is a cat or a dog?
Recommender system: which book that this user will enjoy?
▪Example of classification techniques
k-Nearest Neighbors (k-NN), Naïve Bayes, Decision Trees
42
44. CLASSIFICATION WITH DECISION TREE
▪Make complex decision based on series of simple
conditions.
▪Utilize tree structure to model
the relationships among features and class.
▪Based on concept of recursive
partitioning.
44
46. DECISION TREE WITH IRIS DATASET
▪Tutorial:
http://www.rdatamining.com/examples/decision-
tree
46
47. K-NN
▪Nearest neighbor classifiers assign unlabeled
examples to the class of similar labeled examples.
▪k-NN is a simplest but yet effective classifier.
No assumptions on underlying data distribution
Fast training but slow classification
Require selection of an appropriate k.
Not suitable for nominal data, need additional processing
47
49. SIMILARITY MEASUREMENT
▪Similarity can be measured by distance function on
n-dimension spaces.
Traditionally, Euclidean distance is used.
Let p1 refers to value of first feature of p
tomato (sweetness = 6, crunchiness = 4), and the green
bean (sweetness = 3, crunchiness = 7),
49
50. K-NN EXAMPLE
▪tomato (sweetness = 6, crunchiness = 4)
▪If we calculate distance to its single nearest
neighbor, this is call 1-NN because k=1. 50
Closest single
neighbor
51. K-NN EXAMPLE
▪If k=3, k-NN performs a vote among the three
nearest neighbors.
The majority class among them is fruit, so tomato is fruit.
▪So, question is, what is the appropriate k.
51
52. APPROPRIATE K
▪Large k can leads to underfitting
▪Small k can leads to overfitting
▪One common practice is to use square root of the
number of training examples as k.
▪Try many k, observe the result.
52
53. PREPARING DATA FOR K-NN
▪Rescaling features
Min-max normalization
▪For nominal data, dummy coding can be used.
53
54. K-NN WITH IRIS DATA SET
▪If you don’t have iris data set, load from
http://archive.ics.uci.edu/ml/
▪Tutorial
https://www.datacamp.com/community/tutorials/m
achine-learning-in-r#one
54
55. NAÏVE BAYES
▪“70 percent chance of rain”
▪Use data of past events to extrapolate future events.
▪70 percent chance of rain implies that 7 out of 10 past
cases with similar condition, there is raining in the area.
▪Naïve Bayes classifier is based on Bayesian method.
55
56. BAYESIAN METHODS
▪Developed by Thomas Bayes in 18th century to describe
the probability of events.
▪Estimate likelihood of an event based on the evidence at
hand across multiple trials.
56
57. BAYESIAN METHODS
▪Classifiers utilize training data to calculate an observed
probability of each outcome based on the evidence
provided by feature values.
▪Later, when apply to unlabeled data, it uses the observed
probabilities to predict the most likely class.
▪Probability is a number between 0 and 1, i.e., 0% to
100% chance.
57
58. PROBABILITY
▪If it rained 3 out of 10 days with similar conditions
as today, the probability of rain is estimated as
3/10 = 0.30
▪Probability of event A is P(A), e.g., P(rain) = 0.30
▪So if trail has only two outcome, e.g., rain and not
rain, P(not rain) = 1 – 0.30 = 0.70
Mutually exclusive and exhaustive
58
Not rain (0.70)Rain (0.30)
59. PROBABILITY
▪If a second event is observed together with the first
event, they may have joint probability.
The chance of windy is 0.10
it’s overlapped with rain, this implies that not all windy
day will be rainy day, and vice versa.
▪So, P(rain) = 0.30, P(windy) = 0.10, the chance
that both raining and cloudy occur is written as
P(rain ∩ windy)
59
Not rain (0.70)Rain (0.30)
Windy
(0.10)
Rain (0.30)
60. PROBABILITY
▪Calculating P(rain ∩ windy) depends on the joint
probability of the two event.
If the two events are totally unrelated, they are called
independent events.
But if all events are independent, it would be impossible
to predict one by observing another.
Dependent events are the basis of predictive model.
60
61. PROBABILITY
▪Calculating independent event is simple;
P(rain ∩ windy) = P(rain) * P(windy)
P(A ∩ B) = P(A) * P(B)
▪Calculating dependent event is more complex
that comes Bayes’ theorem.
61
62. BAYES’ THEOREM
▪The relationship between dependent events is
▪P(A|B) = probability of event A given that event B
occurred (AKA, conditional probability)
▪P(A ∩ B) = probability that A and B occurred together
▪P(B) = probability of B alone
62
63. BAYES’ THEOREM
▪By definition, P(A ∩ B) = P(A|B) * P(B), so
▪
63
Posterior
probability
Likelihood
Prior
probability
Marginal
likelihood
64. EXAMPLE: SPAM DETECTION
▪What is the probability that the email is a spam
when there is the word “Viagra” inside?
64
65. EXAMPLE: SPAM DETECTION
▪Construct a frequency table.
Record number of times Viagra appeared in spam and ham
messages.
▪Observation:
P(Viagra=Yes|spam) = 4/20 = 0.20 indicates 20% chance
that spam messages contain word Viagra. 65
66. EXAMPLE: SPAM DETECTION
▪Since P(A ∩ B) = P(B|A) * P(A), so P(spam ∩ Viagra) =
P(Viagra|spam) * P(spam) = 4/20 * 20/100 = 0.04
P(spam ∩ Viagra) == P(spam=yes ∩ Viagra=yes)
▪So, P(spam|Viagra) = P(Viagra|spam) *
P(spam)/P(Viagra) = (4/20)*(20/100)/(5/100) = 0.80
So there is 80% chance that a message is spam, given that it
contains the world Viagra.
66
67. NAÏVE BAYES ALGORITHM
▪Applying Bayes’ theorem to classification problems.
Simple, fast, effect
Work well with large numbers of examples
Easy to understand and explain.
Events might not truly dependent/independent.
67
68. NAÏVE BAYES WITH IRIS DATASET
▪Tutorial
http://rischanlab.github.io/NaiveBayes.html
68
69. FORECASTING NUMERIC DATA
▪Mathematical relationship help us to understand
aspect of every life.
Body weight is a function of one’s calorie intake.
In more detail, 250 kc consumed result in nearly a
kilogram of weight.
▪This might not be perfect fit every situation, but can
be reasonably correct.
69
70. REGRESSION
▪Regression concerns with specifying the relationship
between a dependent variable and independent
variables.
Variables are numeric
▪We will use independent variables to predict
dependent variable.
▪There are many forms of regression, the simples form is
straight line.
i.e., linear regression. 70
72. LINEAR REGRESSION
▪Simple linear regression: single independent
variable.
▪Multiple linear regression: multiple independent
variables.
aka; multiple regression.
72
73. SIMPLE LINEAR REGRESSION EXAMPLE
▪Distress events (dependent) vs. temperature
(independent)
73
y=3.70 – 0.048x
74. SIMPLE LINEAR REGRESSION
▪How to find the best a and b?
Ordinary least squares (OLS): the slope and intercept
are chosen to minimize the sum of the squared error.
74
minimize
76. SIMPLE LINEAR REGRESSION
▪Variance of x, i.e., Var(x), can be expressed as
▪Covariance of x and y, i.e., Cov(x,y) can be
expressed as
▪So
76
77. LINEAR REGRESSION WITH IRIS DATASET
▪Tutorial:
http://www2.warwick.ac.uk/fac/sci/moac/people/
students/peter_cock/r/iris_lm/
77
78. OTHER REGRESSION
▪Logistic regression: model a binary categorical
outcome
▪Poisson regression: models integer count data
▪Multinominal logistic regression: models a
categorical outcome.
78
79. ASSOCIATION RULES
▪Market basket analysis = barcode + inventory
system + personalize shopping profile
▪Itemset = group of item people bought together
{bread, peanut butter, jelly}
▪Result of market basket analysis is a set of
association rules.
Pattern found in the relationships among items in itemsets.
{peanut butter, jelly} -> {bread}
79
80. ASSOCIATION RULES
▪Apriori algorithm, introduced in 1994, is an
efficient algorithm for association rules.
Working with large amount of data
Result rules are easy to understand
Not good for small data
80
82. APRIORI ALGORITHM: SUPPORT AND CONFIDENCE
▪Support measures how frequently itemset occurs in the
data.
{get well card, flowers} = 3/5 = 0.6
{flowers} -> {get well card} = 3/5 = 0.6
{flowers} = 4/5 = 0.8
▪Confidence measurement
predictive power/accuracy.
{flowers} -> {get well card} = 0.6/0.8 = 0.75
82
83. APRIORI PRINCIPLE: BUILDING SET OF RULES
▪Identify all the itemsets that meet a minimum
support threshold.
▪Creating rules from these itemsets using those
meeting a minimum confidence threshold.
83
84. ASSOCIATION RULES WITH GROCERIES DATASET
▪Tutorial:
http://www.salemmarafi.com/code/market-basket-
analysis-with-r/
84
85. CLUSTERING
▪Clustering is an unsupervised learning task
that automatically divides the data into clusters.
No need to be provided with classes as training output.
▪Data can be grouped (clustered) by their similarity
(pattern) in features.
85
86. K-MEANS CLUSTERING ALGORITHM
▪Most commonly used and well studied algorithm.
Simple to learn
Might not found optimized cluster
Need to guess k value
▪K-means assigns each examples to one of the k clusters.
Minimize the differences within each cluster and maximized
the differences between the clusters.
86
87. K-MEAN CLUSTERING
1. Choosing k marker points randomly
2. Calculate distances from each examples
3. to each marker point, assign the example
4. to the closest marker point
5. Update marker location by calculating
6. The centroid of each group of data
7. Keep doing steps 2-6 until stable
87
88. K-MEAN CLUSTERING
1. Choosing k marker points randomly
2. Calculate distances from each examples
3. to each marker point, assign the example
4. to the closest marker point
5. Update marker location by calculating
6. The centroid of each group of data
7. Keep doing steps 2-6 until stable
88
89. K-MEAN CLUSTERING
1. Choosing k marker points randomly
2. Calculate distances from each examples
3. to each marker point, assign the example
4. to the closest marker point
5. Update marker location by calculating
6. The centroid of each group of data
7. Keep doing steps 2-6 until stable
89
90. K-MEAN WITH IRIS DATA SET
▪Tutorial: http://rischanlab.github.io/Kmeans.html
90