The performance of deep neural networks improves with more annotated data. The problem is that the budget for annotation is limited. One solution to this is active learning, where a model asks human to annotate data that it perceived as uncertain. A variety of recent methods have been proposed to apply active learning to deep networks but most of them are either designed specific for their target tasks or computationally inefficient for large networks. In this paper, we propose a novel active learning method that is simple but task-agnostic, and works efficiently with the deep networks. We attach a small parametric module, named “loss prediction module,” to a target network, and learn it to predict target losses of unlabeled inputs. Then, this module can suggest data that the target model is likely to produce a wrong prediction. This method is task-agnostic as networks are learned from a single loss regardless of target tasks. We rigorously validate our method through image classification, object detection, and human pose estimation, with the recent network architectures. The results demonstrate that our method consistently outperforms the previous methods over the tasks
Scaling API-first – The story of a global engineering organization
Learning loss for active learning
1. Learning Loss
for Active Learning
Donggeun Yoo In So Kweon
CVPR 2019 (Oral presentation)
Lunit KAIST
2. Introduction
•Very important for deep learning
•It is not questionable that
more data still improves network performance
[Mahajan et al., ECCV’18]
천만~10억장
11. Active Learning: Limitations
• Heuristic approach
• Highest entropy [Joshi et al., CVPR’09]
• Distance to decision boundaries [Tong & Koller, JMLR’01]
(−) Task-specific design
• Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18]
(−) Not scale to large CNNs and data
• Bayesian approach
• Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07]
• Bayesian inference by dropouts [Gal & Ghahramani ICML’17]
(−) Not scale to large data and CNNs [Sener & Savarese, ICLR’18]
• Distribution approach
• Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18]
(−) Task-specific design
12. *Entropy
• An information-theoretic measure that represents the
information amount needed to “encode” a distribution.
• The use of entropy in active learning
• Dense prediction (0.33, 0.33, 0.33) → maximum
• Sparse prediction (1.00, 0.00, 0.00) → minimum
13. *Entropy
• An information-theoretic measure that represents the
information amount needed to “encode” a distribution.
• The use of entropy in active learning
• Dense prediction (0.33, 0.33, 0.33) → maximum
• Sparse prediction (1.00, 0.00, 0.00) → minimum
(+) Very simple but works well (also in deep networks)
(−) Specific for classification problem
14. Active Learning: Limitations
• Heuristic approach
• Highest entropy [Joshi et al., CVPR’09]
• Distance to decision boundaries [Tong & Koller, JMLR’01]
(−) Task-specific design
• Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18]
(−) Not scale to large CNNs and data
• Bayesian approach
• Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07]
• Bayesian inference by dropouts [Gal & Ghahramani ICML’17]
(−) Not scale to large data and CNNs [Sener & Savarese, ICLR’18]
• Distribution approach
• Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18]
(−) Task-specific design
15. *Bayesian Inference
• Training
• Dropout layer inserted to every convolution layer
• Inference
• N feed forwards → N predictions
• Uncertainty = variance between predictions
16. *Bayesian Inference
• Training
• Dropout layer inserted to every convolution layer
(−) Super slow convergence
→ impractical for current deep nets
• Inference
• N feed forwards → N predictions
• Uncertainty = variance between predictions
(−) Computationally expensive
17. Active Learning: Limitations
• Heuristic approach
• Highest entropy [Joshi et al., CVPR’09]
• Distance to decision boundaries [Tong & Koller, JMLR’01]
(−) Task-specific design
• Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18]
(−) Not scale to large CNNs and data
• Bayesian approach
• Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07]
• Bayesian inference by dropouts [Gal & Ghahramani ICML’17]
(−) Not scale to large data and CNNs [Sener & Savarese, ICLR’18]
• Distribution approach
• Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18]
(−) Task-specific design
21. *Diversity:
Core-set
(+) can be task-agnostic
as it only depends on feature space
(−) not considering ”hard” examples
near the decision boundaries
(−) Expensive optimization for large pool
22. Active Learning: Limitations
• Heuristic approach
• Highest entropy [Joshi et al., CVPR’09]
• Distance to decision boundaries [Tong & Koller, JMLR’01]
(−) Task-specific design
• Ensemble approach [Freund et al., ML’97], [Beluch et al., CVPR’18]
(−) Not scale to large CNNs and data
• Bayesian approach
• Expected error [Roy & McCallum, ICML’01]/model [Kapoor et al., ICCV’07]
• Bayesian inference by dropouts [Gal & Ghahramani ICML’17]
(−) Not scale to large CNNs and data [Sener & Savarese, ICLR’18]
• Distribution approach
• Density-based [Liu & Ferrari, ICCV’17], diversity-based [Sener & Savarese, ICLR’18]
(−) Not considering hard examples
23. Active Learning: Our approach
• Active learning by learning loss
• Attach a “loss prediction module” to a target network
• Learn the module to predict the loss
Unlabeled
pool
⋯Predicted
losses
Labeled
training set
Human oracles
annotate top-𝐾
data points
24. Active Learning: Our approach
• Requirements
• Task-agnostic method
• Not heuristic, learning-based
• Scalable to state-of-the-art networks and large data
25. Active Learning by Learning Loss
Model
Loss prediction module
Input
Target
prediction
Loss
prediction
Target
GT
Target
loss
Loss-prediction
loss
26. Active Learning by Learning Loss
Model
Loss prediction module
Input
Target
prediction
Loss
prediction
Target
GT
Target
loss
Loss-prediction
loss
Multi-task learning
27. Active Learning by Learning Loss
Model
Loss prediction module
Input
Target
prediction
Loss
prediction
Target
GT
Target
loss
Loss-prediction
loss
(+) Applicable to
• any network and data
• any tasks
(+) Nearly zero cost
28. Active Learning by Learning Loss
Model
Loss prediction module
Input
Target
prediction
Loss
prediction
Target
GT
Target
loss
Loss-prediction
loss
(+) Applicable to
• any network and data
• any tasks
(+) Nearly zero cost
𝑥
ො𝑦
𝑦
መ𝑙
𝑙
𝐿loss
መ𝑙, 𝑙
29. Active Learning by Learning Loss
•The loss for loss-prediction 𝐿loss
መ𝑙, 𝑙
•Mean square error?
𝐿𝑙𝑜𝑠𝑠
መ𝑙, 𝑙 = መ𝑙 − 𝑙
2
30. Active Learning by Learning Loss
•The loss for loss-prediction 𝐿loss
መ𝑙, 𝑙
•Mean square error?
→ target task loss 𝑙 reduced as training progresses
𝐿𝑙𝑜𝑠𝑠
መ𝑙, 𝑙 = መ𝑙 − 𝑙
2
Scale changes
31. Active Learning by Learning Loss
•The loss for loss-prediction 𝐿loss
መ𝑙, 𝑙
•To ignore scale changes of 𝑙,
we use a ranking loss
32. Active Learning by Learning Loss
•The loss for loss-prediction 𝐿loss
መ𝑙, 𝑙
•To ignore scale changes of 𝑙,
we use a ranking loss as
𝐿loss
መ𝑙𝑖, መ𝑙𝑗, 𝑙𝑖, 𝑙𝑗 = max 0, −𝟏 𝑙𝑖, 𝑙𝑗 ⋅ መ𝑙𝑖 − መ𝑙𝑗 + 𝜉
where 𝟏 𝑙𝑖, 𝑙𝑗 = ቊ
+1, if 𝑙𝑖 > 𝑙𝑗
−1, otherwise
A pair of
predicted losses
A pair of
real losses
Margin (=1)
33. Active Learning by Learning Loss
•Given a mini-batch B,
the total loss is defined as
1
B
𝑥,𝑦 ∈B
𝐿task ො𝑦, 𝑦 + 𝜆
1
B
⋅
𝑥 𝑖,𝑦 𝑖,𝑥 𝑗,𝑦 𝑗 ∈B
𝐿loss
መ𝑙𝑖, መ𝑙𝑗, 𝑙𝑖, 𝑙𝑗
where 𝑙𝑖 = 𝐿task ො𝑦𝑖, 𝑦𝑖
Target task Loss prediction
A pair 𝑖, 𝑗 within a mini-batch B
34. Active Learning by Learning Loss
•MSE loss VS. Ranking loss
MSE
ResNet-18
CIFAR-10
35. Active Learning by Learning Loss
•MSE loss VS. Ranking loss
MSE
Ranking
ResNet-18
CIFAR-10
36. Active Learning by Learning Loss
•Loss prediction module
Target model
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
FC
Loss
predictionConcat.
37. Active Learning by Learning Loss
•Loss prediction module
Enough convolutions
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
FC
Loss
predictionConcat.
Convolved
features
38. Active Learning by Learning Loss
•Loss prediction module
Enough convolutions
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
FC
Loss
predictionConcat.
Backprop.
to convs
39. Active Learning by Learning Loss
•Loss prediction module
Enough convolutions
• The convolutions would be learned by
the loss prediction loss as well as the target loss
• Sufficiently large receptive field size
40. Active Learning by Learning Loss
•Loss prediction module
Enough convolutions
• The convolutions would be learned by
the loss prediction loss as well as the target loss
• Sufficiently large receptive field size
→ Don’t need more convolutions,
we just focus on merging the multiple features
41. Active Learning by Learning Loss
•Loss prediction module
Target model
FC
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
Loss
prediction
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
Concat.
(+) very efficient as GAP reduces the feature dimension
42. Active Learning by Learning Loss
•Loss prediction module
Target model
Target
model
FC
Loss
prediction
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
Concat.
Conv
BN
ReLU
GAP
FC
ReLU
Conv
BN
ReLU
GAP
FC
ReLU
Conv
BN
ReLU
GAP
FC
ReLU
: Added layer
43. Active Learning by Learning Loss
•Loss prediction module
More convolutions VS. Just FC
ResNet-18
CIFAR-10
44. Active Learning by Learning Loss
•Loss prediction module
Target model
FC
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
Loss
prediction
Mid-
block
Mid-
block
Mid-
block
Out
block
Target
prediction
Concat.
45. Experiments (1)
•To validate “task-agnostic” + “state-of-the-art architectures”
Classification
Task Image
classification
Data CIFAR-10
Net ResNet-18
[He et al., CVPR’16]
46. Experiments (1)
•To validate “task-agnostic” + “state-of-the-art architectures”
Classification Classification
+ regression
Task Image
classification
Object
detection
Data CIFAR-10 PASCAL VOC
2007+2012
Net ResNet-18
[He et al., CVPR’16]
SSD
[Liu et al., ECCV’16]
47. Experiments (1)
•To validate “task-agnostic” + “state-of-the-art architectures”
Classification Classification
+ regression
Regression
Task Image
classification
Object
detection
Human pose
estimation
Data CIFAR-10 PASCAL VOC
2007+2012
MPII
Net ResNet-18
[He et al., CVPR’16]
SSD
[Liu et al., ECCV’16]
Stacked
Hourglass
Networks
[Newell et al., ECCV’16]
48. Results
•Image classification over CIFAR 10
FC
GAP
FC
ReLU
Loss
prediction
Concat.
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
ResNet-18
[He et al., CVPR’16]
Target
prediction
512×4×4
256×8×8
64×32×32
128×16×16
128
128
128
128
512
52. Results
•Image classification
over CIFAR 10
(mean of 5 trials)
[Joshi, CVPR’09]→
[Sener et al., ICLR’18]→
+3.37%
Data selection VS. Architecture
Data selection by active learning → +3.37%
DenseNet121[Huang et al.] − ResNet18 → +2.02%
53. Results
•Object detection
SSD (ImageNet pre-trained)
[Liu et al., ECCV’16]
FC
Loss
prediction
Concat.
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
Target
prediction
512×38×38
1024×19×19
512×10×10
256×5×5
256×3×3
256×1×1
128
768
57. Results
•Object detection on
PASCAL VOC 07+12
(mean of 3 trials)
[Joshi, CVPR’09]→
[Sener et al., ICLR’18]→
+2.21%
Data selection VS. Architecture
Data selection by active learning → +2.21%
YOLOv2[Redmon et al.] − SSD → +1.80%
58. Results
•Human pose estimation
over MPII dataset Stacked Hourglass Network
[Newell et al., ECCV’16]
FC
GAP
FC
ReLU
Loss
prediction
Concat.
GAP
FC
ReLU
GAP
FC
ReLU
GAP
FC
ReLU
Target
prediction
An hourglass256×64×64
256×64×64
256×64×64
256×64×64
128
128
128
128
1024
60. [Joshi, CVPR’09]→
[Sener et al., ICLR’18]→
Results
•Human pose estimation
over MPII dataset
(mean of 3 trials)
61. [Joshi, CVPR’09]→
[Sener et al., ICLR’18]→
Results
•Human pose estimation
over MPII dataset
(mean of 3 trials)
+1.84%
62. [Joshi, CVPR’09]→
[Sener et al., ICLR’18]→
Results
•Human pose estimation
over MPII dataset
(mean of 3 trials)
+1.84%
Data selection VS. Number of stacks
Data selection by active learning → +1.84%
8-stacked − 2-stacked → +0.25%
64. Experiments (2)
•To validate “active domain adaptation”,
Dataset Data stats Active learning
Source
domain
MNIST #train:60k
#test: 10k
Use 60k as an
initial labeled pool
Target
domain
MNIST +
background
#train: 12k
#test: 50k
Add 1k for
each cycle
65. Results
•Image classification over MNIST
*https://github.com/pytorch/examples/tree/master/mnist
FC
GAP
FC
ReLU
Loss
prediction
Concat.
GAP
FC
ReLU
GAP
FC
ReLU
PyTorch MNIST model*
Target
prediction
Conv
ReLU
Conv
ReLU
FC
ReLU
FC
Image
10×12×12
20×4×4
50
64
64
64
192
67. Results
•Domain adaptation
from MNIST
to MNIST+background
•Target domain
performance
[Joshi, CVPR’09]→
[Sener et al., ICLR’18]→
Feature space overfitted
to source domain
68. Results
•Domain adaptation
from MNIST
to MNIST+background
•Target domain
performance
[Joshi, CVPR’09]→
[Sener et al., ICLR’18]→
Feature space overfitted
to source domain
+1.20%
69. Results
•Domain adaptation
from MNIST
to MNIST+background
•Target domain
performance
[Joshi, CVPR’09]→
[Sener et al., ICLR’18]→
Feature space overfitted
to source domain
+1.20%
Data selection VS. Architecture
Data selection by active learning → +1.20%
WideResNet14 − PytorchMNIST(4 layers) → +2.85%
70. Conclusion
•Introduced a novel active learning method that is
• Works well with current deep networks
• Task-agnostic
•Verified with
• Three major visual recognition tasks
• Three popular network architectures
71. Conclusion
•Introduced a novel active learning method that is
• Works well with current deep networks
• Task-agnostic
•Verified with
• Three major visual recognition tasks
• Three popular network architectures
“
”
Pick more important data,
and get better performance!