arXiv1509.07627
http://arxiv.org/abs/1509.07627
In this paper, we evaluate convolutional neural network (CNN) features using the AlexNet architecture developed by [9] and very deep convolutional network (VGGNet) architecture developed by [16]. To date, most CNN researchers have employed the last layers before output, which were extracted from the fully connected feature layers. However, since it is unlikely that feature representation effectiveness is dependent on the problem, this study evaluates additional convolutional layers that are adjacent to fully connected layers, in addition to executing simple tuning
for feature concatenation (e.g., layer 3 + layer 5 + layer7) and transformation, using tools such as principal component analysis. In our experiments, we carried out detection and classification tasks using the Caltech 101 and Daimler Pedestrian Benchmark Datasets.
OECD bibliometric indicators: Selected highlights, April 2024
【arXiv】Feature Evaluation of Deep Convolutional Neural Networks for Object Recognition and Detection
1. Feature Evaluation of Deep Convolutional Neural
Networks for Object Recognition and Detection
Hirokatsu KATAOKA, Kenji Iwata, Yutaka SATOH
National Institute of Advanced Industrial Science and Technology (AIST)
http://www.hirokatsukataoka.net/
arXiv preprint arXiv:1509.07627
http://arxiv.org/abs/1509.07627
2. Feature Evaluation
• Significant task in computer vision
– Based on the DeCAF [Donahue+, ICML2014], we evaluate several CNN
features + SVM classifier
– The representative architecture: AlexNet [Krizhevsky+, NIPS2012] &
VGGNet[Simonyan+, ICLR2015]
– Basic Idea1: Which layer has better feature in CNN architecture?
– Basic Idea2: Mid- & High-level CNN features should be concatenated!
(e.g. Layer 3 + Layer 5 + Layer 7)
3. CNN Architecture & Feature Extraction
• AlexNet & VGGNet
– AlexNet: 8-layer architecture
– VGGNet: 16-layer arhitecture (each pooling layer and last 2 FC layers are
applied as feature vector)
Input
Conv
Conv
Pool
Conv
Pool
FC
FC
So.max
Input
Conv
Conv
Pool
FC
FC
AlexNet
VGGNet
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
Conv
Conv
Pool
FC
So.max
Input
Conv
Pool
FC
So.max
:
Image
input
:
Convolu:onal
layer
:
Max-‐pooling
layer
:
Fully-‐connected
layer
:
So.max
layer
Layer1
Layer2
Layer3
Layer4
Layer5
Layer6
Layer7
Layer1
Layer2
Layer3
Layer4
Layer5
Layer6
Layer7
4. Experiment
• Settings
– Layer: 3 – 7 (middle and deeper layers)
• Conv., pooling and fully-connected layers
– Concatenation and transformation
• Layer 345, 456, 567, 357
• Principal component analysis (PCA): 1500dims
– Classifier
• Support vector machine (SVM)
• The parameters are based on DeCAF [Donahue+, ICML2014]
• Datasets
– Daimler pedestrian benchmark dataset (pedestrian detection) [Munder+,
TPAMI2006]
– Caltech 101 dataset (object classification) [Fei-Fei+, CVPRW2004]
5. Results on the Daimler dataset
• Daimler pedestrian benchmark dataset
– VGGNet Layer 5 (original vector) is the best rate (99.35%)
– In AlexNet, Layer 3 with PCA is the best rate (98.71%)
Mid-layer is tend to be better rate on the pedestrian detection data
6. Results on the Caltech 101 dataset
• Caltech 101 dataset
– VGGNet Layer 5 (original vector) is the best rate (91.80%)
– In AlexNet, Layer 5 with PCA is the best rate (78.37%)
The layer before FC layer performs good rate in object classification
7. Feature Concatenation
• Three-layer connection with PCA
– Layer 345, 456, 567, 357
– 4,500 dimensions (1,500dims at each vector)
– Left: Daimler
– Right: Caltech 101
Daimler Caltech 101
VGGNet layer 567 is the significant tuning
Pedestrian detection: mid-level feature
Object classification: high-level feature
8. Conclusion
• Feature evaluation with AlexNet & VGGNet
– VGGNet is better than AlexNet
– Mid-level feature is good for pedestrian detection, and high-level feature is
good for object classification task
– Concatenation of VGGNet - 5th Pooling, last 2 FC layers is the best setting on
the Daimler pedestrian benchmark and Caltech 101 dataset
– PCA is effective transformation for CNN feature