Introduction to machine learning

INTRODUCTION TO MACHINE LEARNING
Pruet Boonma
Faculty of Engineering
Chiang Mai University
pruet@eng.cmu.ac.th
1
http://screenprism.com/insights/article/why-does-hal-breakdown-and-become-hostile-to-humans-in-2001

AGENDA
▪Machine Learning Background
▪Introduction to R
▪Classification: Nearest Neighbors, Naïve Bayes, Decision
Trees
▪Forecasting with Regression Methods
▪Patterns Recognition with Association Rules
▪Clustering with K-means
▪Black box Model: Neural Network, SVM 2

HORIZON OF DATA ANALYTICS
▪Data Mining/Data Science
Extract patterns from data
▪Big Data
3Vs: Volume, Velocity, Variety
▪Artificial Intelligence
Machine Learning
 Deep Learning
3http://www.kdnuggets.com/2016/03/data-science-puzzle-explained.html

MACHINE LEARNING BACKGROUND
4
https://blogs.nvidia.com/blog/2016/07/29/whats-difference-artificial-intelligence-machine-learning-deep-learning-ai/

▪Artificial Intelligence was first proposed in 1950s
to construct complex machines that inhibit some
characteristic of human intelligence.
General AI: machine that have all human’s senses,
reason, intuitive, imagination and think just like us.
Narrow AI: technology that able to perform
a specific task as well as, or better than,
human cans.
5
ARTIFICIAL INTELLIGENCE: A PURPOSE

EXAMPLE OF NARROW AI
▪Face recognition on Facebook, image classification on Pinterest,
Spam detection on Gmail, Spell suggestion on Google.
6
Computer makes a guess on what
we really want to search

MACHINE LEARNING: AN APPROACH
▪Machine learning is started to flourish in 90s
as an approach in narrow AI.
▪“Ability to learn without being explicitly
programmed” – Arthur Samuel (1954)
▪“A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P,
improves with experience E” – Tom M. Mitchell (1997)
7http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/mlbook.html

EXAMPLE OF MACHINE LEARNING PROBLEM
▪We have to separate chickens into two groups, I’ll
show you how, then you can do the rest.
Task: separate chickens
Experience: I’ll show you how
Performance: separate chicken correctly
8

MACHINE LEARNING APPROACH
▪Instead of hard-coding
a program to perform a
task, the machine is
“trained” using large
amount of data and
algorithms to learn how
to perform a task.
9

TYPE OF PROBLEMS/TASKS
▪Supervised Learning
Computer is presented with training data set (example input
and desired output), the goal is to learn from that.
▪Unsupervised Learning
No training data set, computer needs to find structure from
given input.
▪Reinforcement Learning
Computer finds output from given input, then the output is
evaluated and feedback is given to the computer.
10

MACHINE LEARNING ALGORITHMS
11https://jixta.wordpress.com/2015/07/17/machine-learning-algorithms-mindmap/

MACHINE LEARNING APPLICATIONS
12

EXAMPLE APPLICATIONS
▪Online Marketing
Recommender system
Sentiment analysis
13

▪Computer Vision
Object recognition
Motion analysis
Image restoration
14http://cs.stanford.edu/~taranlan/

▪Internet fraud
Pattern recognition
Bayesian inference
Profiling
15
https://siftscience.com/sift-edu/prevent-fraud/fraud-detection-solutions

▪Self-driving car
Localization
Mapping
Object recognition
16

DEEP LEARNING: A TECHNIQUE
▪Human’s brain is a network
of a huge amount of small
computing devices (neuron).
▪Artificial neural network
emulates human’s brain by
create a network of simple
computing unit.
17Wikipedia

DEEP LEARNING
▪In 2012, Andrew Ng,
then with Google, proposed to use a huge neural
network with many layers.
Deep in deep learning means deep layers.
▪The first application for deep learning is image
recognition.
Now it’s used in many area including Go playing and
automatic cars.
18
http://www.popularmechanics.com/technology/a19863/googles-alphago-ai-wins-second-game-go/

HOW MACHINE LEARNS
▪In machine learning, machines makes sense of
data by creating models, then generalize the
models to make them support new data.
19

DATA STORAGE
▪Human has recorded data since the birth of history.
▪Electronic sensors contributes to explode of data
Wireless Sensor Network, Internet of Things, Big Data
▪The amount of data
beyond human comprehension.
▪Garbage-in, garbage-out.
20
https://www.slideshare.net/Sparkhound/spinning-brown-donuts-why-storage-still-counts

ABSTRACTION
▪Assigned meaning to stored data.
▪Computer summarized stored raw data using
model
An explicit description of the pattern within the data.
Math equations, trees/graphs, logical rules, clusters.
▪Typically, it depends on human to choose model.
Process of fitting a model to a dataset is training.
21https://arnesund.com/2015/05/31/using-amazon-machine-learning-to-predict-the-weather/

Chick
GENERALIZATION
▪The production of abstract can be limited, the idea
need to be generalized to be used in the future.
On tasks that are similar, but not identical.
Infer from the existing models to the new input.
Heuristics, i.e., educated guess, can be used to find most
useful inference.
22
infer
This is also a chick.

23https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/

EVALUATION
▪With limited training data, bias can occurred in the
model.
Chicken needs to be yellow?
▪After trained, model is evaluated with test dataset.
To judge the generality of the model to unseen data.
Model may contain noise, i.e., variation in data.
24

MACHINE LEARNING IN PRACTICE
▪Machine learning process
Data collection: gathering data, in consumable format.
Data exploration and preparation: learn the character of
the data, eliminate unnecessary data, change format, etc.
Model training: human chooses algorithms to be used and
observe the result model.
Model evaluation: models are tested against test dataset.
Model improvement: change algorithm, more data, etc.
25

TYPE OF INPUT DATA
▪Data is expressed in unit of observation.
A set of properties of person, object, transaction, geographic
regions, time point, or measurements.
It can combines with the other, e.g., person-year, person’s
data for one year.
▪Collection of data consists of:
Examples: instances of the unit of observation
Features: recorded properties or attributes
26

TYPE OF FEATURES
▪Numerical : measured in numbers in some
positional numeral systems, i.e., quantitative
property.
▪Categorical/nominal: set of categories, i.e.,
qualitative property.
▪Ordinal: nominal variable with categories falling
in an ordered list, i.e., {small, medium, large}.
27

TYPE OF MACHINE LEARNING ALGORITHMS
▪Predictive model: prediction of one value using the other values
in the data set.
Chance of raining tomorrow?
▪Classification model: predicting which category an example
belongs.
Is this email a spam?
▪Descriptive model: insight gained from summarizing data in new
and interesting way.
What people also buy when they buy milk.
▪Clustering: identifies groups of examples with similar properties.
How many types of customers at a grocery store? 28

INTRODUCTION TO R
▪Open source programming language and software
environment for statistical computing.
Used by statisticians and data miners for developing
statistical software and data analysis.
29
From: Wikipedia

USING R
▪Start Rgui
▪Load internal data
Iris data set
▪Load external data
From csv file
▪Preview data
▪Plot the data
30

R GUI
31
Tool bar
Status bar
Command line +
text output

LOAD INTERNAL DATA
32
Data set name
List of data set example
Data set
variable(feature)

LOAD EXTERNAL DATA
▪We will use data from
https://www.data.go.th/DatasetDetail.aspx?id=70
49410f-5bb8-4c75-9e94-112ca18b63e2
▪Load CSV file, reformat data format to make it
consumable.
33

LOAD EXTERNAL DATA
▪Save to CSV file, e.g., household.csv
▪Load into R using read_csv() command from readr
library
34
File name
Assignment operator
Data set
name

PREVIEW DATA
▪names(): change feature name
▪head(): show feature name of couple of first
examples
▪c(…): create a list
35

PREVIEW
▪summary(): show summarized information of
dataframe (data set)
36

PREVIEW
▪Data set is a matrix, df[row, col]
▪To change order of column, use c() for assign new
order
37
Empty string before comma means
that wee need all row

PLOT THE DATA
▪Boxplot
Distribution of data
38
http://www.physics.csbsju.edu/stats/box2.html
Empty string before comma means
that wee need all row
But only column 2 - 11

BARPLOT
39
Only column 11, transpose column
to row
Use column 1 as X axis label
Too many examples

BARPLOT: TOP 10
40
Order row by the value in column
2558
Only top 10

SCATTER PLOT
▪Tutorial:
http://www2.warwick.ac.uk/fac/sci/moac/people/
students/peter_cock/r/iris_plots
41

CLASSIFICATION
▪Group data based on minimal distance.
▪Applications
Computer vision: this animal is a cat or a dog?
Recommender system: which book that this user will enjoy?
▪Example of classification techniques
k-Nearest Neighbors (k-NN), Naïve Bayes, Decision Trees
42

CLASSIFICATION WITH DECISION TREE
▪Make complex decision based on series of simple
conditions.
▪Utilize tree structure to model
the relationships among features and class.
▪Based on concept of recursive
partitioning.
44

DECISION TREE WITH IRIS DATASET
▪Tutorial:
http://www.rdatamining.com/examples/decision-
tree
46

K-NN
▪Nearest neighbor classifiers assign unlabeled
examples to the class of similar labeled examples.
▪k-NN is a simplest but yet effective classifier.
No assumptions on underlying data distribution
Fast training but slow classification
Require selection of an appropriate k.
Not suitable for nominal data, need additional processing
47

K-NN
EXAMPLE
48Training phaseClassification phase
Distance

SIMILARITY MEASUREMENT
▪Similarity can be measured by distance function on
n-dimension spaces.
Traditionally, Euclidean distance is used.
Let p1 refers to value of first feature of p
tomato (sweetness = 6, crunchiness = 4), and the green
bean (sweetness = 3, crunchiness = 7),
49

K-NN EXAMPLE
▪tomato (sweetness = 6, crunchiness = 4)
▪If we calculate distance to its single nearest
neighbor, this is call 1-NN because k=1. 50
Closest single
neighbor

K-NN EXAMPLE
▪If k=3, k-NN performs a vote among the three
nearest neighbors.
The majority class among them is fruit, so tomato is fruit.
▪So, question is, what is the appropriate k.
51

APPROPRIATE K
▪Large k can leads to underfitting
▪Small k can leads to overfitting
▪One common practice is to use square root of the
number of training examples as k.
▪Try many k, observe the result.
52

PREPARING DATA FOR K-NN
▪Rescaling features
Min-max normalization
▪For nominal data, dummy coding can be used.
53

K-NN WITH IRIS DATA SET
▪If you don’t have iris data set, load from
http://archive.ics.uci.edu/ml/
▪Tutorial
https://www.datacamp.com/community/tutorials/m
achine-learning-in-r#one
54

NAÏVE BAYES
▪“70 percent chance of rain”
▪Use data of past events to extrapolate future events.
▪70 percent chance of rain implies that 7 out of 10 past
cases with similar condition, there is raining in the area.
▪Naïve Bayes classifier is based on Bayesian method.
55

BAYESIAN METHODS
▪Developed by Thomas Bayes in 18th century to describe
the probability of events.
▪Estimate likelihood of an event based on the evidence at
hand across multiple trials.
56

BAYESIAN METHODS
▪Classifiers utilize training data to calculate an observed
probability of each outcome based on the evidence
provided by feature values.
▪Later, when apply to unlabeled data, it uses the observed
probabilities to predict the most likely class.
▪Probability is a number between 0 and 1, i.e., 0% to
100% chance.
57

PROBABILITY
▪If it rained 3 out of 10 days with similar conditions
as today, the probability of rain is estimated as
3/10 = 0.30
▪Probability of event A is P(A), e.g., P(rain) = 0.30
▪So if trail has only two outcome, e.g., rain and not
rain, P(not rain) = 1 – 0.30 = 0.70
Mutually exclusive and exhaustive
58
Not rain (0.70)Rain (0.30)

PROBABILITY
▪If a second event is observed together with the first
event, they may have joint probability.
The chance of windy is 0.10
it’s overlapped with rain, this implies that not all windy
day will be rainy day, and vice versa.
▪So, P(rain) = 0.30, P(windy) = 0.10, the chance
that both raining and cloudy occur is written as
P(rain ∩ windy)
59
Not rain (0.70)Rain (0.30)
Windy
(0.10)
Rain (0.30)

PROBABILITY
▪Calculating P(rain ∩ windy) depends on the joint
probability of the two event.
If the two events are totally unrelated, they are called
independent events.
But if all events are independent, it would be impossible
to predict one by observing another.
Dependent events are the basis of predictive model.
60

PROBABILITY
▪Calculating independent event is simple;
P(rain ∩ windy) = P(rain) * P(windy)
P(A ∩ B) = P(A) * P(B)
▪Calculating dependent event is more complex
that comes Bayes’ theorem.
61

BAYES’ THEOREM
▪The relationship between dependent events is
▪P(A|B) = probability of event A given that event B
occurred (AKA, conditional probability)
▪P(A ∩ B) = probability that A and B occurred together
▪P(B) = probability of B alone
62

BAYES’ THEOREM
▪By definition, P(A ∩ B) = P(A|B) * P(B), so
▪
63
Posterior
probability
Likelihood
Prior
probability
Marginal
likelihood

EXAMPLE: SPAM DETECTION
▪What is the probability that the email is a spam
when there is the word “Viagra” inside?
64

▪Construct a frequency table.
Record number of times Viagra appeared in spam and ham
messages.
▪Observation:
P(Viagra=Yes|spam) = 4/20 = 0.20 indicates 20% chance
that spam messages contain word Viagra. 65

▪Since P(A ∩ B) = P(B|A) * P(A), so P(spam ∩ Viagra) =
P(Viagra|spam) * P(spam) = 4/20 * 20/100 = 0.04
P(spam ∩ Viagra) == P(spam=yes ∩ Viagra=yes)
▪So, P(spam|Viagra) = P(Viagra|spam) *
P(spam)/P(Viagra) = (4/20)*(20/100)/(5/100) = 0.80
So there is 80% chance that a message is spam, given that it
contains the world Viagra.
66

NAÏVE BAYES ALGORITHM
▪Applying Bayes’ theorem to classification problems.
Simple, fast, effect
Work well with large numbers of examples
Easy to understand and explain.
Events might not truly dependent/independent.
67

NAÏVE BAYES WITH IRIS DATASET
▪Tutorial
http://rischanlab.github.io/NaiveBayes.html
68

FORECASTING NUMERIC DATA
▪Mathematical relationship help us to understand
aspect of every life.
Body weight is a function of one’s calorie intake.
In more detail, 250 kc consumed result in nearly a
kilogram of weight.
▪This might not be perfect fit every situation, but can
be reasonably correct.
69

REGRESSION
▪Regression concerns with specifying the relationship
between a dependent variable and independent
variables.
Variables are numeric
▪We will use independent variables to predict
dependent variable.
▪There are many forms of regression, the simples form is
straight line.
i.e., linear regression. 70

LINEAR REGRESSION
▪Slope-intercept form; y = a + bx
71
IndependentDependent
SlopeIntercept

LINEAR REGRESSION
▪Simple linear regression: single independent
variable.
▪Multiple linear regression: multiple independent
variables.
aka; multiple regression.
72

SIMPLE LINEAR REGRESSION EXAMPLE
▪Distress events (dependent) vs. temperature
(independent)
73
y=3.70 – 0.048x

SIMPLE LINEAR REGRESSION
▪How to find the best a and b?
Ordinary least squares (OLS): the slope and intercept
are chosen to minimize the sum of the squared error.
74
minimize

▪Generally, the solution of a depends on the value of b
▪For the value of b, it can calculate from
75

▪Variance of x, i.e., Var(x), can be expressed as
▪Covariance of x and y, i.e., Cov(x,y) can be
expressed as
▪So
76

LINEAR REGRESSION WITH IRIS DATASET
▪Tutorial:
http://www2.warwick.ac.uk/fac/sci/moac/people/
students/peter_cock/r/iris_lm/
77

OTHER REGRESSION
▪Logistic regression: model a binary categorical
outcome
▪Poisson regression: models integer count data
▪Multinominal logistic regression: models a
categorical outcome.
78

ASSOCIATION RULES
▪Market basket analysis = barcode + inventory
system + personalize shopping profile
▪Itemset = group of item people bought together
{bread, peanut butter, jelly}
▪Result of market basket analysis is a set of
association rules.
Pattern found in the relationships among items in itemsets.
{peanut butter, jelly} -> {bread}
79

ASSOCIATION RULES
▪Apriori algorithm, introduced in 1994, is an
efficient algorithm for association rules.
Working with large amount of data
Result rules are easy to understand
Not good for small data
80

APRIORI ALGORITHM
▪Typical buying pattern.
Get well card and flowers
81

APRIORI ALGORITHM: SUPPORT AND CONFIDENCE
▪Support measures how frequently itemset occurs in the
data.
{get well card, flowers} = 3/5 = 0.6
{flowers} -> {get well card} = 3/5 = 0.6
{flowers} = 4/5 = 0.8
▪Confidence measurement
predictive power/accuracy.
{flowers} -> {get well card} = 0.6/0.8 = 0.75
82

APRIORI PRINCIPLE: BUILDING SET OF RULES
▪Identify all the itemsets that meet a minimum
support threshold.
▪Creating rules from these itemsets using those
meeting a minimum confidence threshold.
83

ASSOCIATION RULES WITH GROCERIES DATASET
▪Tutorial:
http://www.salemmarafi.com/code/market-basket-
analysis-with-r/
84

CLUSTERING
▪Clustering is an unsupervised learning task
that automatically divides the data into clusters.
No need to be provided with classes as training output.
▪Data can be grouped (clustered) by their similarity
(pattern) in features.
85

K-MEANS CLUSTERING ALGORITHM
▪Most commonly used and well studied algorithm.
Simple to learn
Might not found optimized cluster
Need to guess k value
▪K-means assigns each examples to one of the k clusters.
Minimize the differences within each cluster and maximized
the differences between the clusters.
86

K-MEAN CLUSTERING
1. Choosing k marker points randomly
2. Calculate distances from each examples
3. to each marker point, assign the example
4. to the closest marker point
5. Update marker location by calculating
6. The centroid of each group of data
7. Keep doing steps 2-6 until stable
87

K-MEAN CLUSTERING
88

K-MEAN CLUSTERING
89

K-MEAN WITH IRIS DATA SET
▪Tutorial: http://rischanlab.github.io/Kmeans.html
90

NEURAL NETWORK
▪Biological neuron
▪Artificial neuron
f is activation function
91

NEURAL NETWORK
▪Network topology
Number of layers/number of neuron in each layer
▪Single-layer Network
▪Multi-layer Network
92

NEURAL NETWORK WITH BOSTON DATA SET
▪Tutorial https://datascienceplus.com/fitting-neural-
network-in-r/
93

SUPPORT VECTOR MACHINES(SVM)
▪SVM is a surface that create boundary (hyperplane)
between points of data in multidimensional.
94

SVM
▪SVM searches for maximum margin hyperplane (MMH) that
creates the greatest separation between the two classes.
95

SVM WITH IRIS DATASET
▪Tutorial: https://www.r-bloggers.com/using-
support-vector-machines-as-flower-finders-name-
that-iris/
96

R PACKAGE
▪R can be extended by installing package
CRAN = Comprehensive R Archive Network
▪Installing R packages
97

INSTALL REQUIRED PACKAGE
▪Repo -> Thailand
▪Install -> ggvis
98

▪https://blogs.nvidia.com/blog/2016/07/29/whats-difference-
artificial-intelligence-machine-learning-deep-learning-ai/
▪http://sphweb.bumc.bu.edu/otlt/mph-
modules/bs/r/r2_summarystats-graphs/R2_SummaryStats-
Graphs_print.html
▪https://www.datacamp.com/community/tutorials/machine-
learning-in-r#one
▪https://www.autodeskresearch.com/publications/samestats
▪https://arnesund.com/2015/05/31/using-amazon-machine-
learning-to-predict-the-weather/
99

▪K mean
▪Cluster
▪Decision tree
▪Regression
▪Assoc rule
▪Logistic regression
100

Introduction to machine learning

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Introduction to machine learning

Similaire à Introduction to machine learning (20)

Dernier

Dernier (20)

Introduction to machine learning