Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Machine learning Introduction
1. Machine
Learning
Introduc1on
guodong@hulu.com
Machine
learning
introduc0on
Logis1c
regression
Feature
selec1on
Boos1ng,
tree
boos1ng
See
more
ML
posts:
h>p://dongguo.me/
4. Learning
• What
is
learning
– Find
rules
from
data/experience
• Why
learning
is
possible
– Assume
rules
exist
in
this
world
• How
to
learn
– Induc1ve
5. What
is
machine
learning
• “Machine
Learning
is
a
field
of
study
that
gives
computers
the
ability
to
learn
without
being
explicitly
programmed”
-‐
Arthur
Samuel
(1959)
• Machine
learning
is
the
study
of
computer
algorithms
that
improve
automa1cally
through
experience”
–
Tom
Mitchell
(1998)
9. Concepts
Problem
Generate
dataset
Dataset
Train
Sample/instance
Feature
vector
label
model
Predict
Test
Model
Tuning
Feature
selec0on
10. What
is
Supervised
learning
• Find
a
func1on
(from
some
func1on
space)
to
predict
for
unseen
instances,
from
the
labeled
training
data
– Func1on
space:
determined
by
the
chosen
model
– Find
the
func1on:
minimize
error
on
training
data
with
some
cost
func1on
• 2
types:
Classifica1on
and
regression
11. Formal
defini1on
• Given
a
training
dataset
r
N
{xi , yi }i =1
• And
define
a
loss
func1on
∧
∧
L( y, y ), where y = f ( x)
• Target
∧
f ( x) =arg min G ( f ),
f
1
st. G ( f ) =
N
N
∑ L( y , f ( x ))
i =1
i
i
12. Models
for
supervised
learning
• Classifica1on
and
regression
– For
classifica1on:
LR(Logis1c
regression),
Naïve
Bayes
– For
regression:
linear
regression
– For
Both:
Trees,
KNN,
SVM,
ANN
• Genera1ve
and
Discrimina1ve
– Genera1ve:
Naïve
Bayes,
GMM,
HMM
– Discrimina1ve:
KNN,
LR,
SVM,
ANN,
Trees
• Parametric
and
nonparametric
– Parametric:
LR,
Naïve
Bayes,
ANN
– nonparametric:
Trees,
KNN,
kernel
methods
13. Decision
Tree
• Would
you
like
to
date
somebody?
Gender
male
female
Good
looking?
Yes!
Pass
No!
umm..
Pass
Others…
Accept
Very
good
Accept
else
Pass
19. Model
Inference
• Typical
inference
methods
– Gradient
descent
– Expecta1on
Maximiza1on
– Sampling
based
20. Model
ensemble
• Averaging
or
vo1ng
output
of
mul1ply
classifiers
• Bagging
(bootstrap
aggrega1ng)
– Train
mul1ple
base
models
– Vote
mul1ply
base
classifiers
with
same
weight
– Improve
model
stability
and
avoid
overfihng
– Work
well
on
unstable
base
classifier
• Adaboost
(adap1ve
boos1ng)
– Sequen1al
base
classifiers
– Misclassified
instances
have
higher
weight
in
next
base
classifier
– Weighted
vo1ng
21. Evalua1on
metrics
• Common
Metrics
for
classifica1on
– Accuracy
– Precision-‐Recall
– AUC
• For
regression
– Mean
absolute
error
(MAE)
– Mean
square
error
(MSE),
RMSE
22. Ques1on1:
How
to
choose
a
suitable
model?
Characteris0c
Naïve
Bayes
Trees
K
Nearest
neighbor
Logis0c
regression
Neural
SVM
Networks
Natural
handling
data
of
“mixed”
type
Robustness
to
outliers
in
input
space
Computa1onal
scalability
Interpretability
1
3
1
1
1
1
3
3
3
3
1
1
3
3
1
3
1
1
2
2
1
2
1
1
Predic1ve
power
1
1
3
2
3
3
<Elements
of
Sta-s-cal
Learning>
II
P351
23. Ques1on2:
Can
we
find
a
100%
accurate
model?
• Expected
risk
• Empirical
risk
• Choose
a
family
for
candidate
predic1on
func1ons
• Error
24. Case
study:
Predic1ve
Demographic
Feature
extrac1on
(‘show’,
‘ad
vote’,
‘ad
selec1on’)
feature
analysis
(remove
‘ad
selec1on’)
Load
login
profile
ML
problem?
What
kind?
Labels?
Evalua1on
metric?
Possible
features?
(show,
ad
vote,
ad
selec1on,
search…)
Accessible?
Problem
Dataset
genera1on
Choose
a
Model
1. Familiar?
(NB,
ANN,
LR,
Tree,
SVM)
2. Computa1onal
cost?
Interpretability?
Precision?
3. Data:
amount?
noise
ra1o?
Train
Try
more
features(add
‘OS’,
‘browser’,
‘flash’)
Feature
selec1on
(remove
‘flash’,
and
non
anonymous
features)
Predictor
Try
more
models
Tuning
Evalua1on
(AUC,
Precision-‐recall)
Test
Challenges
(Noise,
different
Join
distribu1on,
evalua1on)
model
ensemble
Predictor
on
product
Scoring
Online
Update
25. Challenges
in
Machine
learning
• Data
– Sparse
data
in
high
dimensions
– Limited
labels
• Computa1on
Cost
– Speed
Up
advanced
models
– Paralleliza1on
• Applica1on
– Structured
predic1on
28. Books
•
•
•
•
Machine
Learning
[link]
by
Mitchell
Pa-ern
Recogni0on
and
Machine
Learning
[link]
by
Bishop
The
Elements
of
Sta0s0cal
Learning
[link]
Scaling
Up
Machine
Learning
[link]
29. Lectures
• Machine
Learning
open
class
–
by
Andrew
Ng
– Video
in
YouTube
• Advanced
topics
in
Machine
Learning
–
Cornell
• h>p://videolectures.net/
30. Other
research
resource
• Research
Organs
– Yahoo
Research
[link]
– Google
Research
publica1ons
[link]
• Dataset
– UCI
machine
learning
Repository
[link]
– kaggle.com
Unsupervised learning(聚类,降维(topic model)): learn structure from unlabeled data. Closely related with density estimation; summarize the dataSemi-supervised learning: use both labeled and unlabeled samples for training; It’s cost to collect lots of labels sometimes, use both
Logistic regression is one of the most popular classifier.Advantage: 1. easy understand and implement; 2. not bad performance; 3. light weight and less time taken for training and prediction;(can handle large dataset) 4. easy parallelizationValue to attendances:Know about what is logistic regression, what’s the advantages and disadvantage. what kind of problems are suitable apply to.L1 and L2 regularizationHow to inference through maximize likelihood with gradient descent. And know how to implement it
对于generalized linear model,如果response variable是binomial或者multinomial分布,且选择了logit function作为link function 就是logistic regressionLogistic function 是logit function的反函数
Link function: (1) generalized linear model的重要组成部分:将linear regression拓展到generalized linear model;(2)link function的反函数的自变量介于(-无穷,+无穷),若y服从binominal分布,应变量介于【0,1】区间The inverse of any continuous cumulative distribution function (CDF) can be used for the link since the CDF’s range is [0,1]
Generalized linear model 广义上的线性模型,都有一个基本的线性单元W*X(linear regression),通过各种link function建立该线性单元和各种分布的response variable的关系。包含linear regression (normal distribution),logistic regression (binominal/multi-nominal distribution), Poisson regression (Poisson distribution)对于binominal/multi-nominal distribution,我们也可以选择除logit link function之外的link function (广义的logistic regression)