SlideShare une entreprise Scribd logo
1  sur  113
Télécharger pour lire hors ligne
Hands-on Tutorial of
Machine Learning in Python
中央研究院資訊科學所資料洞察實驗室 張鈞閔
2017.07.26
1
EvaluationLearning
- Model selection
- Cross-validation
- Metric
- Hyperparameter
optimization
Flow Chart of Predictive Modeling
2
Preprocessing Prediction
Labels
Raw
Data
Training
Dataset
Testing
Dataset
Learning
Algorithm
New
Data
Predicted
Labels
Final
Model
Labels
- Feature extraction
- Normalization
- Feature selection
- Dimension reduction
- Sampling
Today we will focus on …
Scikit-learn: a powerful Python library in the field of
machine learning 3
Preprocessing Learning Evaluation Prediction
Labels
Raw
Data
Training
Dataset
Testing
Dataset
Learning
Algorithm
New
Data
Labels
Final
Model
Labels
Overview Scikit-learn: Machine Learning in Python
http://scikit-learn.org/dev/tutorial/machine_learning_map/index.html
4
The content of this lecture partially comes from
(1) https://github.com/jakevdp/sklearn_tutorial
(2) http://scikit-learn.org/stable/documentation.html
(3) https://blog.keras.io/
Let’s do it!
5
What is
Machine Learning?
About building programs
with tuneable parameters
that are adjusted
automatically to improve
their performance by
adapting to previously seen
data.
Categorize into 3 classes
 Supervised Learning
⇒ Data with explicit labels
 Unsupervised Learning
⇒ Data without labels
 Reinforcement Learning
⇒ Data with implicit labels
6
Supervised Learning
 In supervised learning, we have a dataset consisting of both
features (input variables) and labels (output variables)
 The task is to construct an estimator (model) which enables
to predict the labels of an instance given the set of features
 Two categories: Classification and Regression
 Classification: the label is discrete
 Regression: the label is continuous
 Split into training and testing datasets
7
Training and Testing Dataset
 To validate the generalization of a trained model,
splitting a whole dataset into training and testing
datasets
 Based on the training dataset, we fit and optimize
predictive models
 Use testing datasets to evaluate the performance of
trained models to ensure the model generalization
8
Generalization
 We hope that a model trained on the basis of a training
dataset can seamlessly apply to unseen testing dataset
 Generalization to unseen data
 If the model overfits the training dataset, its performance
on testing dataset will be worse
 Higher model complexity, easier to overfitting
9
Classification
 For example, k-Nearest Neighbor for iris
classification problem:
10
Common Metrics of Classification
 Confusion matrix
 Accuracy =
TN+TP
ALL= (TN+TP+FN+FP)
11
Predicted 0 Predicted 1
Actual 0 True Negative
(TN)
False Positive
(FP)
Actual 1 False Negative
(FN)
True Positive
(TP)
More Metrics
12
Predicted 0 Predicted 1
Actual 0 True Negative
(TN)
False Positive
(FP)
Actual 1 False Negative
(FN)
True Positive
(TP)
Precision =
TP
TP+FP
Recall =
TP
FN+TP
F1 score= 2
Recall∗Precision
Recall+Precision
13
Regression
 For example, fit a line to the data
14
Common Metrics in Regression
 Mean absolute error (MAE)
 Mean square error (MSE)
 R squared
15
Unsupervised Learning
 The data has no labels but we are interested in
(1) describing hidden structure from instances
(2) finding similarity among instances
 Unsupervised learning comprises tasks such as
dimensionality reduction and clustering
16
Dimensionality Reduction
 For example, conduct PCA to visualize the iris
dataset in two dimensions
17
Clustering
 For example, K-means clustering for data
compression
18
Scikit-learn’s Estimator Interface
 Available in all estimators
 model.fit() : fit training data, X: data, Y: label
Supervised: model.fit(X , Y)  Unsupervised: model.fit(X)
 Available in supervised estimators
 model.predict() : predict the label of a new set of data by model
 model.predict_proba() : for classification problems, some
estimators provides this method to return probability of each class
 Available in unsupervised estimators
 model.transform() : transform new data into new basis by model
 model.fit_transform() : some estimators implement this method to
efficiently perform a fit and transform on the same input data 19
Supervised
Learning In-depth
Do have the labels 
Model-based learning
 Linear regression
 Regression with Regularization
 Logistic regression
 Support vector machine
 Decision Tree
 Random Forests
 XGBoost
Instance-based learning
 Naive Bayesian model
 K-nearest neighbor (KNN)
20
Unsupervised
Learning In-depth
Have no idea about labels 
Dimension reduction
 Principal component analysis
Clustering
 K-means clustering
 Hierarchical clustering
 DBSCAN
21
進行流程
1. 介紹主題
2. 說明如何使用 Python 完成
3. 範例說明 (example/)
4. 動手練習 (exercise/)
23
Supervised
Learning In-depth
Do have the labels 
Model-based learning
 Linear regression
 Regression with Regularization
 Logistic regression
 Support vector machine
 Decision Tree
 Random Forests
 XGBoost
Instance-based learning
 Naive Bayesian model
 K-nearest neighbor (KNN)
24
Supervised
Learning In-depth
Do have the labels 
Model-based learning
 Linear regression
 Regression with regularization
 Logistic regression
 Support vector machine
 Decision Tree
 Random Forests
 XGBoost
Instance-based learning
 Naive Bayesian model
 K-nearest neighbor (KNN)
25
Linear Regression
 Based on the assumption that a linear relationship
exists between input (𝑥𝑖) and output variables (y)
 Linear models are still powerful because input
variables can be arbitrarily transformed
 For example, polynomial regression
26
y = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ + 𝑎 𝑛 𝑥 𝑛
𝑥1 = 𝑣, 𝑥2 = 𝑣2, … , 𝑎𝑛𝑑 𝑥 𝑛 = 𝑣 𝑛
Linear Model in Python
27
from sklearn.linear_model import LinearRegression, Lasso, Ridge
def example_regression(data,power,plot_dict,reg_type,alpha=0):
# define estimator object
type_dict = {'Linear' : LinearRegression(normalize=True),
'Lasso' : Lasso(alpha = alpha, normalize=True),
'Ridge' : Ridge(alpha = alpha, normalize=True)}
# generate X of different powers
X = ['x']
if power >= 2:
X.extend(['x_%d' %i for i in range(2,power+1)])
# fit the model
if reg_type in type_dict:
model = type_dict[reg_type]
model.fit(data[X],data['y'])
y_pred = model.predict(data[X])
Example
example/01_linear_model.ipynb
 Here we start from a toy example: fitting a sine curve
with additional noise
 Our goal is to estimate this function using polynomial
regression with powers of x form 1 to 15
28
x = np.array([i*np.pi/180 for i in range(60,300,4)])
np.random.seed(100)
y = np.sin(x) + np.random.normal(0,0.15,len(x))
Results (linear regression)
29
As the model complexity increases, the models tends
to fit even smaller deviations in the training data set.
When Model Complexity Increases
30
Magnitude of coefficients also increase
Generalization
 If coefficient magnitude is large, a small input
deviation would lead to large output deviation
 For example
31
∑X1
X2
W1=10
W2=10
0.6
0.4
10
b=0
+0.1
+1
∑X1
X2
W1=1
W2=1
0.6
0.4
10
b=9
+0.1
+0.1
For better generalization, we usually add weight regularization
Ridge Regression
 Perform L2-regularization, i.e. add penalty to the
square of the magnitude term into the cost function
32
Cost = Prediction error + 𝛼 (weights)2
Result (Ridge Regression)
33
Effectively reduce the model complexity
Weight Regularization
34
Magnitude of coefficients do not increase significantly
Lasso Regression
 Perform L1-regularization, i.e. add penalty to the
absolute value of the magnitude term into the cost
function
 LASSO stands for
Least Absolute Shrinkage and Selection Operator
35
Cost = Prediction error + 𝛼 |weights|
Result (Lasso Regression)
36
α = 0.001
Effectively reduce the model complexity
Sparsity  Feature Selection
37
High sparsity
38
Why Lasso Leads to Sparsity
 L2 L1
𝑤2
𝑤1
𝑤1 + |𝑤2|
𝑤1
2
+ 𝑤2
2
𝑤2
𝑤1
When L1 regularization is adopted,
it is more likely to converge at the corner
The Effect of α
 Change the magnitude of α to see what’s going on
 For Lasso regression, larger α leads to higher sparsity
39
Exercise (15 mins)
exercise/ex01_linear_model.ipynb
 Boston house price prediction
 Hints
 Consider nonlinear transformations of current features
 Regression with regularization
40
Short Summary
 Increasing model complexity causes the weight
explosion, and this may result in bad generalization
 Regularization helps limit the growth of model
complexity
 Lasso: add L1 regularization, especially beneficial to
select features due to the property of sparsity
 Ridge: add L2 regularization
41
Supervised
Learning In-depth
Do have the labels 
Model-based learning
 Linear regression
 Regression with Regularization
 Logistic regression
 Support vector machine
 Decision Tree
 Random Forests
 XGBoost
Instance-based learning
 Naive Bayesian model
 K-nearest neighbor (KNN)
42
Logistic Regression (LR)
 Logistic regression is a linear model for classification
 NOT require a linear relationship between the input
and output variables
 Can handle all sorts of relationships because there is
non-linear log transformation applied to the
prediction
 Output range = (0,1)
43
Mathematical Formulation of LR
 For a pair of input (x) and output (y), assume that the
probability model
 LR finds 𝑤 by maximizing the likelihood
44
𝑃(𝑦|𝑥) =
1
1 + 𝑒−𝑦𝑤 𝑇 𝑥
max
𝑤
𝑖=1
𝑘
𝑃(𝑦𝑖|𝑥𝑖) ,
𝑖 = 1, … , 𝑘 (𝑘 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠)
Formulation in Scikit-Learn
 Take log on likelihood
 Can add L1 or L2 regularization
45
max
𝑤
𝑖=1
𝑘
𝑃(𝑦𝑖|𝑥𝑖) = max
𝑤
𝑖=1
𝑘
log(𝑃(𝑦𝑖|𝑥𝑖))
= max
𝑤
𝑖=1
𝑘
−𝑙𝑜𝑔(1 + 𝑒−𝑦 𝑖 𝑤 𝑇 𝑥 𝑖)
= min
𝑤
𝑖=1
𝑘
𝑙𝑜𝑔(1 + 𝑒−𝑦 𝑖 𝑤 𝑇 𝑥𝑖)
Regularized Logistic Regression
 For example, L2 penalized logistic regression
 Unlike the previous linear models, here adjust the
weight of error term by C
 Solvers: liblinear, lbfgs, newton-cg, sag
46
min
𝑤
1
2
𝑤 𝑇 𝑤 + 𝑪
𝑖=1
𝑘
𝑙𝑜𝑔(1 + 𝑒−𝑦 𝑖 𝑤 𝑇 𝑥 𝑖)
Solver Selection
 Only liblinear solver support L1 penalization
47
Case Solver
Small dataset or L1 penalty liblinear
Multinomial or large dataset lbfgs, sag or newton-cg
Very Large dataset sag
Logistic Regression in Python
49
from sklearn.linear_model import
LogisticRegression
logreg = LogiticRegression(
# {newton-cg, lbfgs, liblinear, sag}
solver = ‘liblinear’,
# ‘l1’ only for liblinear solver
penalty = ‘l2’,
# {‘ovr’, ‘multinomial’}
multi_class = ‘ovr’,
# smaller values, stronger regularization
C = 1.0
)
Example
example/ 02_logistic_multiclass.ipynb
 Two options to address multiclass problem:
multinomial and One-vs-Rest
example/02_logistic_iris.ipynb
 Change the value of C to observe the results
50
Exercise (15 mins)
exercise/ex02_logistic_regression.ipynb
 Classification of handwritten digits
51
Support Vector Machine (SVM)
 A powerful method for both classification and
regression
 Construct a hyper-plane or set of hyper-planes in a
high dimensional space
 The hyper-plane has the largest distance (margin) to
the nearest training data points of any classes
 Different Kernel functions can be specified for the
decision function
52
An Example in 2D
53
Which one line better separates the data points ?
Intuition of SVM: Maximal Margin
54
The middle fit clearly has the largest margin
(perpendicular distance between data points)
Mathematical Formulation of SVM
55
𝑤 𝑇
𝑥 + 𝑏 =
+1
0
−1
The decision function, 𝑓 𝑥 = 𝑠𝑖𝑔𝑛(𝑤 𝑇 𝑥 + 𝑏)
𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 , if 𝑦𝑖 = 1
𝑤 𝑇
𝑥𝑖 + 𝑏 ≤ −1 , if 𝑦𝑖 = −1
𝑦𝑖(𝑤 𝑇
𝑥𝑖 + 𝑏) ≥ 1
Support vectors
Mathematical Formulation of SVM
 Distance between and
 Maximizing D is equivalent to minimizing
 The primal problem of the SVM
56
𝑤 𝑇 𝑥 + 𝑏 = 1 𝑤 𝑇 𝑥 + 𝑏 = −1
𝐷 =
2
||𝑤||
=
2
𝑤 𝑇 𝑤
𝑤 𝑇 𝑤
min
𝑤,𝑏
1
2
𝑤 𝑇
𝑤
subject to 𝑦𝑖 𝑤 𝑇
𝑥𝑖 + 𝑏 ≥ 1,
𝑖 = 1, … , 𝑘
Not Always Linearly Separable
 Tolerate training error but try to minimize it
 Project to a high dimensional feature space
but 𝜙(𝑥) may be very complex
57
min
𝑤,𝑏
1
2
𝑤 𝑇
𝑤 + 𝐶
𝑖=1
𝑘
𝜺𝒊
subject to 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 − 𝜺𝒊,
𝜀𝑖 ≥ 0,
𝑖 = 1, … , 𝑘
𝑥 → 𝜙(𝑥)
From The Dual Problem of SVM
 Use Lagrange relexation to derive the dual problem
of SVM (skip the details here, please check [1])
 The dual problem of SVM
58[1] http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVM2.pdf
subject to
𝑖=1
𝑘
𝛼𝑖 𝑦𝑖 = 0, ∀𝛼𝑖 ≥ 0
maximize 𝐿 𝐷 =
𝑖=1
𝑘
𝛼𝑖 −
1
2
𝑖,𝑗
𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝜙(𝑥𝑖) 𝑇
𝜙(𝑥𝑗)
maybe easier than 𝜙(𝑥𝑖)
Kernel function, 𝒌(𝒙𝒊, 𝒙𝒋)
Radial Basis Function (RBF)
 The RBF kernel on two feature vectors 𝑥𝑖 and 𝑥𝑗, is
defined as
 The feature space of kernel has an infinite number of
dimensions
59
𝑘 𝑥𝑖, 𝑥𝑗 = exp −
𝑥𝑖 − 𝑥𝑗
2
2𝜎2
Example
 Transform by the radial basis function
60
𝑥 (𝑥, 𝜙(𝑥))
𝜙(𝑥)
Linear kernel versus RBF kernel
61
Linear kernel
RBF kernel
SVM in Python
62
from sklearn import svm
clf = svm.SVC(
# {linear, poly, rbf}
kernel = ‘linear’,
# used only if kernel=‘poly’
degree = 3,
# enable predict_prob()
probability = True,
# smaller values, stronger regularization
C = 1.0
)
Example - Support Vector
example/04_svm_support_vector.ipynb
 Support vectors, in actual, is a subset of training
points in the decision function
 Obtain class probability by using probability = True
63
Example - Kernel Tricks
example/04_svm_kernel_trick.ipynb
 ‘linear’: linear kernel
 ‘poly’: polynomial kernel function (controlled by the
parameter degree)
 ‘rbf’: radial basis function (default setting)
 Project to an infinite feature space
64
Exercise (15 mins)
exercise/ex03_svm_iris.ipynb
 Try to visualize the decision functions
 Try to circle out the support vectors
 Try to compare different kernel functions
 Apply SVM on the handwritten digits dataset in ex02
65
Supervised
Learning In-depth
Do have the labels 
Model-based learning
 Linear regression
 Regression with Regularization
 Logistic regression
 Support vector machine
 Decision Tree
 Random Forests
 XGBoost
Instance-based learning
 Naive Bayesian model
 K-nearest neighbor (KNN)
66
Decision Tree
 Used for both classification and regression
 Decided by a series of splits, and a split implies a
binary cut on one dimension
 Each split should maximize the information gain
 Decision tree use impurity measure instead
 Maximizing information gain
= minimizing impurity measure
67
Decision Tree
 Simple to interpret and visualize
 No need to data normalization
68
Height > 2 meters ?
Two legs? Height < 1 meter?
Y N
Y N Y N
Splitting Criteria
 Assumed that Class 1, 2, …, C and current node t
 For a split (binary cut) to maximize the gain
Gain = I(parent) – I(Left child) – I (right child)
69
𝑃 𝑖 𝑡 =
# 𝑜𝑓 𝒄𝒍𝒂𝒔𝒔 𝒊 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑡
# 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑡
Gini impurity Entropy
𝐼 𝐺 𝑡 =
𝑖=1
𝐶
𝑃(𝑖|𝑡)(1 − 𝑃 𝑖 𝑡 ) 𝐼 𝐸 𝑡 = −
𝑖=1
𝐶
𝑃 𝑖 𝑡 𝑙𝑜𝑔(𝑃 𝑖 𝑡 )
For example (Gini)
70
(20,20)
(10,10) (10,10)
(20,20)
(20,10) (0,10)
𝐼 𝑃 =
1
2
×
1
2
+
1
2
×
1
2
= 0.5
𝐺𝑎𝑖𝑛 =
1
2
−
1
2
×
1
2
−
1
2
×
1
2
= 0
𝐼𝐿 =
2
3
×
1
3
+
1
3
×
2
3
=
4
9
𝐼 𝑅 = 0 + 1 = 1
𝐺𝑎𝑖𝑛 =
1
2
−
3
4
×
4
9
−
1
4
=
1
6
𝐼𝐿 =
1
2
×
1
2
+
1
2
×
1
2
= 0.5
𝐼 𝑅 =
1
2
×
1
2
+
1
2
×
1
2
= 0.5
𝐼 𝑃 =
1
2
×
1
2
+
1
2
×
1
2
=
1
2
Feature Importance
 The importance of a feature is computed as the total
reduction of the criteria brought by that feature
 Sum of importance of all features is equal to 1
 In practice, we often use feature importance
provided by a tree model to rank and select features
71
Decision Tree in Python
72
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(
criterion = ‘gini’,
max_depth = None,
min_samples_split = 2,
min_samples_leaf = 1,
)
# feature importance
clf.feature_importances_
Example
example/05_decision_tree.ipynb
 criteria = {‘gini’, ‘entropy’}
 max_depth = the maximum depth of a tree
 If None, nodes are expanded until all leaves are pure or
until all leaves contain less than min_samples_split samples
 min_sample_split: the minimum number of samples
required to split
 min_sample_leaf: the minimum number of samples
required to be at a leaf node 73
Tree Ensemble
 Aggregate multiple weak learners to a strong learner
 This strong learner should have better generalization
and prevent overfitting
 Voting mechanism (classification) or taking average
of all prediction values (regression) 74
Learner 1 Learner n
Ensemble Method - Bagging
 Sample a subset of the training dataset to train a
weak learner
 Each learner is independent and this bagging process
can be parallelized
 Random Forests
75
Ensemble Method - Boosting
 Instead of sampling the input features, sample the
training data
 At each iteration, the data that is wrongly classified
will have its weight increased
 AdaBoost, Gradient Boosting Tree
76
Decision Tree and Random Forests
example/06_randomforests_classification.ipynb
77
Grid Search of Hyper-parameter
78
from sklearn.model_selection import GridSearchCV
parameters = {‘max_depth’:[3, 5, 7, 9, 11],
‘criterion’:[‘gini’,’entropy’]
}
clf = DecisionTreeClassifier()
clf_grid = GridSearchCV(clf,parameters)
clf_grid.fit(X, Y)
Gradient Boosting Tree
example/07_gradient_boosting_regression.ipynb
 Compare among decision tree, random forests, and
gradient boosting tree
 Apply on a regression problem
 Grid search
79
Exercise
exercise/ex04_random_forest_digits_classification.ipy
nb
80
XGBoost
 Similar with Gradient Boosting Tree
 Add regularization in loss function
 Use 1st and 2nd derivative in learning
 Adopt feature sampling (like random forests)
81
Installation
 Windows: http://www.jianshu.com/p/5b3e0489f1a8
 Mac/Linux: pip install xgboost
82
Example
example/11_xgboost.ipynb
83
Unsupervised
Learning In-depth
Have no idea about labels 
Dimension reduction
 Principal Component
Analysis (PCA)
Clustering
 K-means clustering
 Hierarchical clustering
 DBSCAN
84
Principal Component Analysis (PCA)
 PCA is to use orthogonal transformation to convert a
set of data of a possible correlated variables into a set
of values of linearly uncorrelated variables
 A new variable is one of linear combinations of the
original variables
85
Linear Combination of Variables
 The original variables is noted as 𝑥1, 𝑥2, … , 𝑥 𝑛, and the
new variables can be represented as
86
𝑧1 = 𝑎11 𝑥1 + 𝑎12 𝑥2 + ⋯ + 𝑎1𝑛 𝑥 𝑛
𝑧2 = 𝑎21 𝑥1 + 𝑎22 𝑥2 + ⋯ + 𝑎2𝑛 𝑥 𝑛
𝑧 𝑛 = 𝑎 𝑛1 𝑥1 + 𝑎 𝑛2 𝑥2 + ⋯ + 𝑎 𝑛𝑛 𝑥 𝑛
⋮
Uncorrelated
Covariance Matrix (symmetric)
 Can be viewed as
 The transformed and uncorrelated variables implies
that the co-variance matrix is diagonal
87
𝐴 =
𝑎11 𝑎12
𝑎21 𝑎22
⋯
𝑎1𝑛
𝑎2𝑛
⋮ ⋱ ⋮
𝑎 𝑛1 𝑎 𝑛2 ⋯ 𝑎 𝑛𝑛
=
𝑎1
𝑎2
⋮
𝑎 𝑛
𝛴 𝑧 = 𝐸(𝑧𝑧 𝑇
) = 𝐴𝐸(𝑥𝑥 𝑇
)𝐴 𝑇
= 𝐴𝛴 𝑥 𝐴 𝑇
=
𝜎1
2
0
0 𝜎2
2 ⋯
0
0
⋮ ⋱ ⋮
0 0 ⋯ 𝜎 𝑛
2
𝑧 = 𝐴 𝑇
𝑥 𝐴 𝑇
= [ 𝑎1 𝑎2 ⋯ 𝑎 𝑛]
Assumption on Project Matrix
 Assumed that the project matrix is orthonormal
 𝐴 𝑇 𝐴 = 𝐼 (identity matrix)
 Spectral theorem: a real symmetric matrix can be
“diagonalized” by an orthogonal matrix
88
𝛴 𝑥 = 𝐴 𝑇
𝛴 𝑧 𝐴 = 𝐴 𝑇
𝜎1
2
0
0 𝜎2
2 ⋯
0
0
⋮ ⋱ ⋮
0 0 ⋯ 𝜎 𝑛
2
𝐴
Approximating Covarince Matrix
 Rank the orthogonal eigenvectors according to its
eigenvalues
 The eigenvector with the largest eigenvalue is the
first principal component
 Use top principal components to approximate the
original covariance matrix
89
PCA in Python
90
from sklearn.decomposition import PCA
pca = PCA(
n_components = 2,
whiten = True,
svd_solver = ‘auto’
)
# principal axes in feature space
pca.components_
# explained variance
pca.explained_variance_ratio_
Example
example/07_PCA.ipynb
 An example of 2D data
 Visualize PCA components
 Determine the number of components
91
Exercise
exercies/ex05_PCA.ipynb
92
Unsupervised
Learning In-depth
Have no idea about labels 
Dimension reduction
 Principal Component Analysis
Clustering
 K-means clustering
 Hierarchical clustering
 DBSCAN
93
K-Means Clustering
 Unsupervised clustering based on “distance” among
data points (usually, squared Euclidean distance)
 The goal is to minimize the within-cluster sum of
squared errors (cluster inertia)
 Should be careful of the range of each feature
94
Procedure of K-means Clustering
1. Randomly pick K data points as initial cluster centers
2. Assign other data points to the closest centers
3. Re-calculate the center of each cluster
4. Repeat step 2 and 3 until the assignments of all data
points are unchanged, the stopping criteria is
satisfied (for example, maximal number of iterations)
95
K-means Clustering in Python
 n_init = 10 means K-means clustering will perform 10
times using different initial centers and adopt the
clustering with minimal SSE
96
from sklearn.cluster import KMeans
pca = KMeans(
n_clusters = 3,
init = ‘random’,
n_init = 10,
max_iter = 300,
tol = 1e-4
)
How to determine the best K
 Summation of with-cluster sum of squared error over
all clusters
 Stored in attribute inertia_
97
Silhouette Analysis
 Used to quantify the quality of clustering
 For every data point:
1. Calculate the average distances of all points in the
same cluster, a
2. Calculate the average distance of all points in the
closest cluster, b
3. Calculate silhouette coefficient, s
98
s =
𝑏 − 𝑎
max(𝑏, 𝑎)
99
Quantification of Clustering Quality
 K=2 K=3
Example
example/09_kmeans_clustering.ipynb
100
Exercise
exercise/kmeans_clustering_color_compression.ipynb
 Use MiniBatchKmeans instead for efficiency
101
Unsupervised
Learning In-depth
Have no idea about labels 
Dimension reduction
 Principal Component Analysis
Clustering
 K-means clustering
 Hierarchical clustering
 DBSCAN
10
2
Hierarchical Clustering
 No need to specify the number of clusters in advance
 Procedures
1. Each data point is viewed as an individual cluster
2. Calculate the distance between any two clusters
3. Merge two closest clusters into one cluster
4. Repeat 2 and 3 until all data points are merged into
one cluster
103
Agglomerative clustering
Example
example/10_hierarchical_clustering.ipynb
 A toy example to go through the procedure of
hierarchical clustering step by step
 Visualize the dendrogram in Python
104
Unsupervised
Learning In-depth
Have no idea about labels 
Dimension reduction
 Principal Component Analysis
Clustering
 K-means clustering
 Hierarchical clustering
 DBSCAN
10
5
DBSCAN
 Density-based Spatial Clustering of Applications with
Noise
 Density: number of samples in range of ε
106
ε
3 Types of Data Points
 A data point is a core point if there are at least
MinPts neighboring data points in range of ε
 A data point is a border point if it is in range of a core
point and contains less than MinPts neighboring data
points
 A data point is a noise point if it is neither a core point
nor a border point
107
Procedure of DBSCAN
 Two core points are connected if the distance
between them less than ε
 An individual core point or a connected core point is a
cluster
 Assign each border points to the corresponding
clusters of core points
108
Example
 MinPts = 3
109
Noise point
Border point
Core point
Difference among Clustering Methods
 Unlike K-means clustering, DBSCAN does not
assume the spherical variance of clusters
 Unlike hierarchical clustering, DBSCAN allows to
ignore noise points
110
Example
example/11_DBSCAN_and_review.ipynb
 Compare different clustering algorithms in a special
case of moon-like data distribution
111
Exercise
exercise/ex08_plot_cluster_comparison.ipynb
 Show characteristics of different clustering
algorithms on many interesting datasets
112
Recap: Supervised Learning
 Linear regression
 Regression with Regularization
 Logistic regression
 Support vector machine
 Decision Tree
 Random Forests
 XGBoost
126
Recap: Unsupervised Learning
Dimension reduction
 Principal component analysis
Clustering
 K-means clustering
 Hierarchical clustering
 DBSCAN
127
Thank you!
If any questions, please feel free to contact
cmchang@iis.sinica.edu.tw
中央研究院資訊科學所資料洞察實驗室
12
8

Contenu connexe

Tendances

K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
Vu Pham
 

Tendances (20)

is anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayesis anyone_interest_in_auto-encoding_variational-bayes
is anyone_interest_in_auto-encoding_variational-bayes
 
Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr Deep learning paper review ppt sourece -Direct clr
Deep learning paper review ppt sourece -Direct clr
 
Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)The Perceptron (D1L2 Deep Learning for Speech and Language)
The Perceptron (D1L2 Deep Learning for Speech and Language)
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
Paper review: Measuring the Intrinsic Dimension of Objective Landscapes.
 
K-means, EM and Mixture models
K-means, EM and Mixture modelsK-means, EM and Mixture models
K-means, EM and Mixture models
 
Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4
 
Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learning
 
Clustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture modelClustering:k-means, expect-maximization and gaussian mixture model
Clustering:k-means, expect-maximization and gaussian mixture model
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions[PR12] Generative Models as Distributions of Functions
[PR12] Generative Models as Distributions of Functions
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
 
Neural Learning to Rank
Neural Learning to RankNeural Learning to Rank
Neural Learning to Rank
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 

Similaire à Hands-on Tutorial of Machine Learning in Python

pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
butest
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
butest
 

Similaire à Hands-on Tutorial of Machine Learning in Python (20)

MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1MLHEP 2015: Introductory Lecture #1
MLHEP 2015: Introductory Lecture #1
 
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018Backpropagation - Elisa Sayrol - UPC Barcelona 2018
Backpropagation - Elisa Sayrol - UPC Barcelona 2018
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspacespptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
 
PCA and LDA in machine learning
PCA and LDA in machine learningPCA and LDA in machine learning
PCA and LDA in machine learning
 
Session 4 .pdf
Session 4 .pdfSession 4 .pdf
Session 4 .pdf
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Module 4 - Linear Model for Classification.pptx
Module 4 - Linear Model for Classification.pptxModule 4 - Linear Model for Classification.pptx
Module 4 - Linear Model for Classification.pptx
 
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Machine Learning and Statistical Analysis
Machine Learning and Statistical AnalysisMachine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
 
Jörg Stelzer
Jörg StelzerJörg Stelzer
Jörg Stelzer
 
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights ReservedMachine learning pt.1: Artificial Neural Networks ® All Rights Reserved
Machine learning pt.1: Artificial Neural Networks ® All Rights Reserved
 

Dernier

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 

Dernier (20)

BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Hands-on Tutorial of Machine Learning in Python

  • 1. Hands-on Tutorial of Machine Learning in Python 中央研究院資訊科學所資料洞察實驗室 張鈞閔 2017.07.26 1
  • 2. EvaluationLearning - Model selection - Cross-validation - Metric - Hyperparameter optimization Flow Chart of Predictive Modeling 2 Preprocessing Prediction Labels Raw Data Training Dataset Testing Dataset Learning Algorithm New Data Predicted Labels Final Model Labels - Feature extraction - Normalization - Feature selection - Dimension reduction - Sampling
  • 3. Today we will focus on … Scikit-learn: a powerful Python library in the field of machine learning 3 Preprocessing Learning Evaluation Prediction Labels Raw Data Training Dataset Testing Dataset Learning Algorithm New Data Labels Final Model Labels
  • 4. Overview Scikit-learn: Machine Learning in Python http://scikit-learn.org/dev/tutorial/machine_learning_map/index.html 4
  • 5. The content of this lecture partially comes from (1) https://github.com/jakevdp/sklearn_tutorial (2) http://scikit-learn.org/stable/documentation.html (3) https://blog.keras.io/ Let’s do it! 5
  • 6. What is Machine Learning? About building programs with tuneable parameters that are adjusted automatically to improve their performance by adapting to previously seen data. Categorize into 3 classes  Supervised Learning ⇒ Data with explicit labels  Unsupervised Learning ⇒ Data without labels  Reinforcement Learning ⇒ Data with implicit labels 6
  • 7. Supervised Learning  In supervised learning, we have a dataset consisting of both features (input variables) and labels (output variables)  The task is to construct an estimator (model) which enables to predict the labels of an instance given the set of features  Two categories: Classification and Regression  Classification: the label is discrete  Regression: the label is continuous  Split into training and testing datasets 7
  • 8. Training and Testing Dataset  To validate the generalization of a trained model, splitting a whole dataset into training and testing datasets  Based on the training dataset, we fit and optimize predictive models  Use testing datasets to evaluate the performance of trained models to ensure the model generalization 8
  • 9. Generalization  We hope that a model trained on the basis of a training dataset can seamlessly apply to unseen testing dataset  Generalization to unseen data  If the model overfits the training dataset, its performance on testing dataset will be worse  Higher model complexity, easier to overfitting 9
  • 10. Classification  For example, k-Nearest Neighbor for iris classification problem: 10
  • 11. Common Metrics of Classification  Confusion matrix  Accuracy = TN+TP ALL= (TN+TP+FN+FP) 11 Predicted 0 Predicted 1 Actual 0 True Negative (TN) False Positive (FP) Actual 1 False Negative (FN) True Positive (TP)
  • 12. More Metrics 12 Predicted 0 Predicted 1 Actual 0 True Negative (TN) False Positive (FP) Actual 1 False Negative (FN) True Positive (TP) Precision = TP TP+FP Recall = TP FN+TP F1 score= 2 Recall∗Precision Recall+Precision
  • 13. 13
  • 14. Regression  For example, fit a line to the data 14
  • 15. Common Metrics in Regression  Mean absolute error (MAE)  Mean square error (MSE)  R squared 15
  • 16. Unsupervised Learning  The data has no labels but we are interested in (1) describing hidden structure from instances (2) finding similarity among instances  Unsupervised learning comprises tasks such as dimensionality reduction and clustering 16
  • 17. Dimensionality Reduction  For example, conduct PCA to visualize the iris dataset in two dimensions 17
  • 18. Clustering  For example, K-means clustering for data compression 18
  • 19. Scikit-learn’s Estimator Interface  Available in all estimators  model.fit() : fit training data, X: data, Y: label Supervised: model.fit(X , Y)  Unsupervised: model.fit(X)  Available in supervised estimators  model.predict() : predict the label of a new set of data by model  model.predict_proba() : for classification problems, some estimators provides this method to return probability of each class  Available in unsupervised estimators  model.transform() : transform new data into new basis by model  model.fit_transform() : some estimators implement this method to efficiently perform a fit and transform on the same input data 19
  • 20. Supervised Learning In-depth Do have the labels  Model-based learning  Linear regression  Regression with Regularization  Logistic regression  Support vector machine  Decision Tree  Random Forests  XGBoost Instance-based learning  Naive Bayesian model  K-nearest neighbor (KNN) 20
  • 21. Unsupervised Learning In-depth Have no idea about labels  Dimension reduction  Principal component analysis Clustering  K-means clustering  Hierarchical clustering  DBSCAN 21
  • 22. 進行流程 1. 介紹主題 2. 說明如何使用 Python 完成 3. 範例說明 (example/) 4. 動手練習 (exercise/) 23
  • 23. Supervised Learning In-depth Do have the labels  Model-based learning  Linear regression  Regression with Regularization  Logistic regression  Support vector machine  Decision Tree  Random Forests  XGBoost Instance-based learning  Naive Bayesian model  K-nearest neighbor (KNN) 24
  • 24. Supervised Learning In-depth Do have the labels  Model-based learning  Linear regression  Regression with regularization  Logistic regression  Support vector machine  Decision Tree  Random Forests  XGBoost Instance-based learning  Naive Bayesian model  K-nearest neighbor (KNN) 25
  • 25. Linear Regression  Based on the assumption that a linear relationship exists between input (𝑥𝑖) and output variables (y)  Linear models are still powerful because input variables can be arbitrarily transformed  For example, polynomial regression 26 y = 𝑎1 𝑥1 + 𝑎2 𝑥2 + ⋯ + 𝑎 𝑛 𝑥 𝑛 𝑥1 = 𝑣, 𝑥2 = 𝑣2, … , 𝑎𝑛𝑑 𝑥 𝑛 = 𝑣 𝑛
  • 26. Linear Model in Python 27 from sklearn.linear_model import LinearRegression, Lasso, Ridge def example_regression(data,power,plot_dict,reg_type,alpha=0): # define estimator object type_dict = {'Linear' : LinearRegression(normalize=True), 'Lasso' : Lasso(alpha = alpha, normalize=True), 'Ridge' : Ridge(alpha = alpha, normalize=True)} # generate X of different powers X = ['x'] if power >= 2: X.extend(['x_%d' %i for i in range(2,power+1)]) # fit the model if reg_type in type_dict: model = type_dict[reg_type] model.fit(data[X],data['y']) y_pred = model.predict(data[X])
  • 27. Example example/01_linear_model.ipynb  Here we start from a toy example: fitting a sine curve with additional noise  Our goal is to estimate this function using polynomial regression with powers of x form 1 to 15 28 x = np.array([i*np.pi/180 for i in range(60,300,4)]) np.random.seed(100) y = np.sin(x) + np.random.normal(0,0.15,len(x))
  • 28. Results (linear regression) 29 As the model complexity increases, the models tends to fit even smaller deviations in the training data set.
  • 29. When Model Complexity Increases 30 Magnitude of coefficients also increase
  • 30. Generalization  If coefficient magnitude is large, a small input deviation would lead to large output deviation  For example 31 ∑X1 X2 W1=10 W2=10 0.6 0.4 10 b=0 +0.1 +1 ∑X1 X2 W1=1 W2=1 0.6 0.4 10 b=9 +0.1 +0.1 For better generalization, we usually add weight regularization
  • 31. Ridge Regression  Perform L2-regularization, i.e. add penalty to the square of the magnitude term into the cost function 32 Cost = Prediction error + 𝛼 (weights)2
  • 32. Result (Ridge Regression) 33 Effectively reduce the model complexity
  • 33. Weight Regularization 34 Magnitude of coefficients do not increase significantly
  • 34. Lasso Regression  Perform L1-regularization, i.e. add penalty to the absolute value of the magnitude term into the cost function  LASSO stands for Least Absolute Shrinkage and Selection Operator 35 Cost = Prediction error + 𝛼 |weights|
  • 35. Result (Lasso Regression) 36 α = 0.001 Effectively reduce the model complexity
  • 36. Sparsity  Feature Selection 37 High sparsity
  • 37. 38 Why Lasso Leads to Sparsity  L2 L1 𝑤2 𝑤1 𝑤1 + |𝑤2| 𝑤1 2 + 𝑤2 2 𝑤2 𝑤1 When L1 regularization is adopted, it is more likely to converge at the corner
  • 38. The Effect of α  Change the magnitude of α to see what’s going on  For Lasso regression, larger α leads to higher sparsity 39
  • 39. Exercise (15 mins) exercise/ex01_linear_model.ipynb  Boston house price prediction  Hints  Consider nonlinear transformations of current features  Regression with regularization 40
  • 40. Short Summary  Increasing model complexity causes the weight explosion, and this may result in bad generalization  Regularization helps limit the growth of model complexity  Lasso: add L1 regularization, especially beneficial to select features due to the property of sparsity  Ridge: add L2 regularization 41
  • 41. Supervised Learning In-depth Do have the labels  Model-based learning  Linear regression  Regression with Regularization  Logistic regression  Support vector machine  Decision Tree  Random Forests  XGBoost Instance-based learning  Naive Bayesian model  K-nearest neighbor (KNN) 42
  • 42. Logistic Regression (LR)  Logistic regression is a linear model for classification  NOT require a linear relationship between the input and output variables  Can handle all sorts of relationships because there is non-linear log transformation applied to the prediction  Output range = (0,1) 43
  • 43. Mathematical Formulation of LR  For a pair of input (x) and output (y), assume that the probability model  LR finds 𝑤 by maximizing the likelihood 44 𝑃(𝑦|𝑥) = 1 1 + 𝑒−𝑦𝑤 𝑇 𝑥 max 𝑤 𝑖=1 𝑘 𝑃(𝑦𝑖|𝑥𝑖) , 𝑖 = 1, … , 𝑘 (𝑘 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠)
  • 44. Formulation in Scikit-Learn  Take log on likelihood  Can add L1 or L2 regularization 45 max 𝑤 𝑖=1 𝑘 𝑃(𝑦𝑖|𝑥𝑖) = max 𝑤 𝑖=1 𝑘 log(𝑃(𝑦𝑖|𝑥𝑖)) = max 𝑤 𝑖=1 𝑘 −𝑙𝑜𝑔(1 + 𝑒−𝑦 𝑖 𝑤 𝑇 𝑥 𝑖) = min 𝑤 𝑖=1 𝑘 𝑙𝑜𝑔(1 + 𝑒−𝑦 𝑖 𝑤 𝑇 𝑥𝑖)
  • 45. Regularized Logistic Regression  For example, L2 penalized logistic regression  Unlike the previous linear models, here adjust the weight of error term by C  Solvers: liblinear, lbfgs, newton-cg, sag 46 min 𝑤 1 2 𝑤 𝑇 𝑤 + 𝑪 𝑖=1 𝑘 𝑙𝑜𝑔(1 + 𝑒−𝑦 𝑖 𝑤 𝑇 𝑥 𝑖)
  • 46. Solver Selection  Only liblinear solver support L1 penalization 47 Case Solver Small dataset or L1 penalty liblinear Multinomial or large dataset lbfgs, sag or newton-cg Very Large dataset sag
  • 47. Logistic Regression in Python 49 from sklearn.linear_model import LogisticRegression logreg = LogiticRegression( # {newton-cg, lbfgs, liblinear, sag} solver = ‘liblinear’, # ‘l1’ only for liblinear solver penalty = ‘l2’, # {‘ovr’, ‘multinomial’} multi_class = ‘ovr’, # smaller values, stronger regularization C = 1.0 )
  • 48. Example example/ 02_logistic_multiclass.ipynb  Two options to address multiclass problem: multinomial and One-vs-Rest example/02_logistic_iris.ipynb  Change the value of C to observe the results 50
  • 49. Exercise (15 mins) exercise/ex02_logistic_regression.ipynb  Classification of handwritten digits 51
  • 50. Support Vector Machine (SVM)  A powerful method for both classification and regression  Construct a hyper-plane or set of hyper-planes in a high dimensional space  The hyper-plane has the largest distance (margin) to the nearest training data points of any classes  Different Kernel functions can be specified for the decision function 52
  • 51. An Example in 2D 53 Which one line better separates the data points ?
  • 52. Intuition of SVM: Maximal Margin 54 The middle fit clearly has the largest margin (perpendicular distance between data points)
  • 53. Mathematical Formulation of SVM 55 𝑤 𝑇 𝑥 + 𝑏 = +1 0 −1 The decision function, 𝑓 𝑥 = 𝑠𝑖𝑔𝑛(𝑤 𝑇 𝑥 + 𝑏) 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 , if 𝑦𝑖 = 1 𝑤 𝑇 𝑥𝑖 + 𝑏 ≤ −1 , if 𝑦𝑖 = −1 𝑦𝑖(𝑤 𝑇 𝑥𝑖 + 𝑏) ≥ 1 Support vectors
  • 54. Mathematical Formulation of SVM  Distance between and  Maximizing D is equivalent to minimizing  The primal problem of the SVM 56 𝑤 𝑇 𝑥 + 𝑏 = 1 𝑤 𝑇 𝑥 + 𝑏 = −1 𝐷 = 2 ||𝑤|| = 2 𝑤 𝑇 𝑤 𝑤 𝑇 𝑤 min 𝑤,𝑏 1 2 𝑤 𝑇 𝑤 subject to 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1, 𝑖 = 1, … , 𝑘
  • 55. Not Always Linearly Separable  Tolerate training error but try to minimize it  Project to a high dimensional feature space but 𝜙(𝑥) may be very complex 57 min 𝑤,𝑏 1 2 𝑤 𝑇 𝑤 + 𝐶 𝑖=1 𝑘 𝜺𝒊 subject to 𝑦𝑖 𝑤 𝑇 𝑥𝑖 + 𝑏 ≥ 1 − 𝜺𝒊, 𝜀𝑖 ≥ 0, 𝑖 = 1, … , 𝑘 𝑥 → 𝜙(𝑥)
  • 56. From The Dual Problem of SVM  Use Lagrange relexation to derive the dual problem of SVM (skip the details here, please check [1])  The dual problem of SVM 58[1] http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/SVM2.pdf subject to 𝑖=1 𝑘 𝛼𝑖 𝑦𝑖 = 0, ∀𝛼𝑖 ≥ 0 maximize 𝐿 𝐷 = 𝑖=1 𝑘 𝛼𝑖 − 1 2 𝑖,𝑗 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝜙(𝑥𝑖) 𝑇 𝜙(𝑥𝑗) maybe easier than 𝜙(𝑥𝑖) Kernel function, 𝒌(𝒙𝒊, 𝒙𝒋)
  • 57. Radial Basis Function (RBF)  The RBF kernel on two feature vectors 𝑥𝑖 and 𝑥𝑗, is defined as  The feature space of kernel has an infinite number of dimensions 59 𝑘 𝑥𝑖, 𝑥𝑗 = exp − 𝑥𝑖 − 𝑥𝑗 2 2𝜎2
  • 58. Example  Transform by the radial basis function 60 𝑥 (𝑥, 𝜙(𝑥)) 𝜙(𝑥)
  • 59. Linear kernel versus RBF kernel 61 Linear kernel RBF kernel
  • 60. SVM in Python 62 from sklearn import svm clf = svm.SVC( # {linear, poly, rbf} kernel = ‘linear’, # used only if kernel=‘poly’ degree = 3, # enable predict_prob() probability = True, # smaller values, stronger regularization C = 1.0 )
  • 61. Example - Support Vector example/04_svm_support_vector.ipynb  Support vectors, in actual, is a subset of training points in the decision function  Obtain class probability by using probability = True 63
  • 62. Example - Kernel Tricks example/04_svm_kernel_trick.ipynb  ‘linear’: linear kernel  ‘poly’: polynomial kernel function (controlled by the parameter degree)  ‘rbf’: radial basis function (default setting)  Project to an infinite feature space 64
  • 63. Exercise (15 mins) exercise/ex03_svm_iris.ipynb  Try to visualize the decision functions  Try to circle out the support vectors  Try to compare different kernel functions  Apply SVM on the handwritten digits dataset in ex02 65
  • 64. Supervised Learning In-depth Do have the labels  Model-based learning  Linear regression  Regression with Regularization  Logistic regression  Support vector machine  Decision Tree  Random Forests  XGBoost Instance-based learning  Naive Bayesian model  K-nearest neighbor (KNN) 66
  • 65. Decision Tree  Used for both classification and regression  Decided by a series of splits, and a split implies a binary cut on one dimension  Each split should maximize the information gain  Decision tree use impurity measure instead  Maximizing information gain = minimizing impurity measure 67
  • 66. Decision Tree  Simple to interpret and visualize  No need to data normalization 68 Height > 2 meters ? Two legs? Height < 1 meter? Y N Y N Y N
  • 67. Splitting Criteria  Assumed that Class 1, 2, …, C and current node t  For a split (binary cut) to maximize the gain Gain = I(parent) – I(Left child) – I (right child) 69 𝑃 𝑖 𝑡 = # 𝑜𝑓 𝒄𝒍𝒂𝒔𝒔 𝒊 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑡 # 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑎𝑡 𝑛𝑜𝑑𝑒 𝑡 Gini impurity Entropy 𝐼 𝐺 𝑡 = 𝑖=1 𝐶 𝑃(𝑖|𝑡)(1 − 𝑃 𝑖 𝑡 ) 𝐼 𝐸 𝑡 = − 𝑖=1 𝐶 𝑃 𝑖 𝑡 𝑙𝑜𝑔(𝑃 𝑖 𝑡 )
  • 68. For example (Gini) 70 (20,20) (10,10) (10,10) (20,20) (20,10) (0,10) 𝐼 𝑃 = 1 2 × 1 2 + 1 2 × 1 2 = 0.5 𝐺𝑎𝑖𝑛 = 1 2 − 1 2 × 1 2 − 1 2 × 1 2 = 0 𝐼𝐿 = 2 3 × 1 3 + 1 3 × 2 3 = 4 9 𝐼 𝑅 = 0 + 1 = 1 𝐺𝑎𝑖𝑛 = 1 2 − 3 4 × 4 9 − 1 4 = 1 6 𝐼𝐿 = 1 2 × 1 2 + 1 2 × 1 2 = 0.5 𝐼 𝑅 = 1 2 × 1 2 + 1 2 × 1 2 = 0.5 𝐼 𝑃 = 1 2 × 1 2 + 1 2 × 1 2 = 1 2
  • 69. Feature Importance  The importance of a feature is computed as the total reduction of the criteria brought by that feature  Sum of importance of all features is equal to 1  In practice, we often use feature importance provided by a tree model to rank and select features 71
  • 70. Decision Tree in Python 72 from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier( criterion = ‘gini’, max_depth = None, min_samples_split = 2, min_samples_leaf = 1, ) # feature importance clf.feature_importances_
  • 71. Example example/05_decision_tree.ipynb  criteria = {‘gini’, ‘entropy’}  max_depth = the maximum depth of a tree  If None, nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples  min_sample_split: the minimum number of samples required to split  min_sample_leaf: the minimum number of samples required to be at a leaf node 73
  • 72. Tree Ensemble  Aggregate multiple weak learners to a strong learner  This strong learner should have better generalization and prevent overfitting  Voting mechanism (classification) or taking average of all prediction values (regression) 74 Learner 1 Learner n
  • 73. Ensemble Method - Bagging  Sample a subset of the training dataset to train a weak learner  Each learner is independent and this bagging process can be parallelized  Random Forests 75
  • 74. Ensemble Method - Boosting  Instead of sampling the input features, sample the training data  At each iteration, the data that is wrongly classified will have its weight increased  AdaBoost, Gradient Boosting Tree 76
  • 75. Decision Tree and Random Forests example/06_randomforests_classification.ipynb 77
  • 76. Grid Search of Hyper-parameter 78 from sklearn.model_selection import GridSearchCV parameters = {‘max_depth’:[3, 5, 7, 9, 11], ‘criterion’:[‘gini’,’entropy’] } clf = DecisionTreeClassifier() clf_grid = GridSearchCV(clf,parameters) clf_grid.fit(X, Y)
  • 77. Gradient Boosting Tree example/07_gradient_boosting_regression.ipynb  Compare among decision tree, random forests, and gradient boosting tree  Apply on a regression problem  Grid search 79
  • 79. XGBoost  Similar with Gradient Boosting Tree  Add regularization in loss function  Use 1st and 2nd derivative in learning  Adopt feature sampling (like random forests) 81
  • 82. Unsupervised Learning In-depth Have no idea about labels  Dimension reduction  Principal Component Analysis (PCA) Clustering  K-means clustering  Hierarchical clustering  DBSCAN 84
  • 83. Principal Component Analysis (PCA)  PCA is to use orthogonal transformation to convert a set of data of a possible correlated variables into a set of values of linearly uncorrelated variables  A new variable is one of linear combinations of the original variables 85
  • 84. Linear Combination of Variables  The original variables is noted as 𝑥1, 𝑥2, … , 𝑥 𝑛, and the new variables can be represented as 86 𝑧1 = 𝑎11 𝑥1 + 𝑎12 𝑥2 + ⋯ + 𝑎1𝑛 𝑥 𝑛 𝑧2 = 𝑎21 𝑥1 + 𝑎22 𝑥2 + ⋯ + 𝑎2𝑛 𝑥 𝑛 𝑧 𝑛 = 𝑎 𝑛1 𝑥1 + 𝑎 𝑛2 𝑥2 + ⋯ + 𝑎 𝑛𝑛 𝑥 𝑛 ⋮ Uncorrelated
  • 85. Covariance Matrix (symmetric)  Can be viewed as  The transformed and uncorrelated variables implies that the co-variance matrix is diagonal 87 𝐴 = 𝑎11 𝑎12 𝑎21 𝑎22 ⋯ 𝑎1𝑛 𝑎2𝑛 ⋮ ⋱ ⋮ 𝑎 𝑛1 𝑎 𝑛2 ⋯ 𝑎 𝑛𝑛 = 𝑎1 𝑎2 ⋮ 𝑎 𝑛 𝛴 𝑧 = 𝐸(𝑧𝑧 𝑇 ) = 𝐴𝐸(𝑥𝑥 𝑇 )𝐴 𝑇 = 𝐴𝛴 𝑥 𝐴 𝑇 = 𝜎1 2 0 0 𝜎2 2 ⋯ 0 0 ⋮ ⋱ ⋮ 0 0 ⋯ 𝜎 𝑛 2 𝑧 = 𝐴 𝑇 𝑥 𝐴 𝑇 = [ 𝑎1 𝑎2 ⋯ 𝑎 𝑛]
  • 86. Assumption on Project Matrix  Assumed that the project matrix is orthonormal  𝐴 𝑇 𝐴 = 𝐼 (identity matrix)  Spectral theorem: a real symmetric matrix can be “diagonalized” by an orthogonal matrix 88 𝛴 𝑥 = 𝐴 𝑇 𝛴 𝑧 𝐴 = 𝐴 𝑇 𝜎1 2 0 0 𝜎2 2 ⋯ 0 0 ⋮ ⋱ ⋮ 0 0 ⋯ 𝜎 𝑛 2 𝐴
  • 87. Approximating Covarince Matrix  Rank the orthogonal eigenvectors according to its eigenvalues  The eigenvector with the largest eigenvalue is the first principal component  Use top principal components to approximate the original covariance matrix 89
  • 88. PCA in Python 90 from sklearn.decomposition import PCA pca = PCA( n_components = 2, whiten = True, svd_solver = ‘auto’ ) # principal axes in feature space pca.components_ # explained variance pca.explained_variance_ratio_
  • 89. Example example/07_PCA.ipynb  An example of 2D data  Visualize PCA components  Determine the number of components 91
  • 91. Unsupervised Learning In-depth Have no idea about labels  Dimension reduction  Principal Component Analysis Clustering  K-means clustering  Hierarchical clustering  DBSCAN 93
  • 92. K-Means Clustering  Unsupervised clustering based on “distance” among data points (usually, squared Euclidean distance)  The goal is to minimize the within-cluster sum of squared errors (cluster inertia)  Should be careful of the range of each feature 94
  • 93. Procedure of K-means Clustering 1. Randomly pick K data points as initial cluster centers 2. Assign other data points to the closest centers 3. Re-calculate the center of each cluster 4. Repeat step 2 and 3 until the assignments of all data points are unchanged, the stopping criteria is satisfied (for example, maximal number of iterations) 95
  • 94. K-means Clustering in Python  n_init = 10 means K-means clustering will perform 10 times using different initial centers and adopt the clustering with minimal SSE 96 from sklearn.cluster import KMeans pca = KMeans( n_clusters = 3, init = ‘random’, n_init = 10, max_iter = 300, tol = 1e-4 )
  • 95. How to determine the best K  Summation of with-cluster sum of squared error over all clusters  Stored in attribute inertia_ 97
  • 96. Silhouette Analysis  Used to quantify the quality of clustering  For every data point: 1. Calculate the average distances of all points in the same cluster, a 2. Calculate the average distance of all points in the closest cluster, b 3. Calculate silhouette coefficient, s 98 s = 𝑏 − 𝑎 max(𝑏, 𝑎)
  • 97. 99 Quantification of Clustering Quality  K=2 K=3
  • 100. Unsupervised Learning In-depth Have no idea about labels  Dimension reduction  Principal Component Analysis Clustering  K-means clustering  Hierarchical clustering  DBSCAN 10 2
  • 101. Hierarchical Clustering  No need to specify the number of clusters in advance  Procedures 1. Each data point is viewed as an individual cluster 2. Calculate the distance between any two clusters 3. Merge two closest clusters into one cluster 4. Repeat 2 and 3 until all data points are merged into one cluster 103 Agglomerative clustering
  • 102. Example example/10_hierarchical_clustering.ipynb  A toy example to go through the procedure of hierarchical clustering step by step  Visualize the dendrogram in Python 104
  • 103. Unsupervised Learning In-depth Have no idea about labels  Dimension reduction  Principal Component Analysis Clustering  K-means clustering  Hierarchical clustering  DBSCAN 10 5
  • 104. DBSCAN  Density-based Spatial Clustering of Applications with Noise  Density: number of samples in range of ε 106 ε
  • 105. 3 Types of Data Points  A data point is a core point if there are at least MinPts neighboring data points in range of ε  A data point is a border point if it is in range of a core point and contains less than MinPts neighboring data points  A data point is a noise point if it is neither a core point nor a border point 107
  • 106. Procedure of DBSCAN  Two core points are connected if the distance between them less than ε  An individual core point or a connected core point is a cluster  Assign each border points to the corresponding clusters of core points 108
  • 107. Example  MinPts = 3 109 Noise point Border point Core point
  • 108. Difference among Clustering Methods  Unlike K-means clustering, DBSCAN does not assume the spherical variance of clusters  Unlike hierarchical clustering, DBSCAN allows to ignore noise points 110
  • 109. Example example/11_DBSCAN_and_review.ipynb  Compare different clustering algorithms in a special case of moon-like data distribution 111
  • 110. Exercise exercise/ex08_plot_cluster_comparison.ipynb  Show characteristics of different clustering algorithms on many interesting datasets 112
  • 111. Recap: Supervised Learning  Linear regression  Regression with Regularization  Logistic regression  Support vector machine  Decision Tree  Random Forests  XGBoost 126
  • 112. Recap: Unsupervised Learning Dimension reduction  Principal component analysis Clustering  K-means clustering  Hierarchical clustering  DBSCAN 127
  • 113. Thank you! If any questions, please feel free to contact cmchang@iis.sinica.edu.tw 中央研究院資訊科學所資料洞察實驗室 12 8