CVGIP-2015

VISUAL APPEARANCE-BASED FOOD RECOGNITION
USING SPARSE CODING
1
Duan-Yu Chen (陳敦裕), 1
Hao-Syuan Wang (王皓玄),
2
Yue-Min Jiang (蔣岳珉) and
2
Szu-Han Tsao (曹思漢)
1
Dept. of Electrical Engineering, Yuan Ze University, Taiwan
2
Industrial Technology Research Institute, SSTC, Taiwan, ROC
E-mail: dychen@saturn.yzu.edu.tw, s1000654@mail.yzu.edu.tw, jongfat@itri.org.tw,
alfredtzao@itri.org.tw
ABSTRACT
In recent years, food recognition techniques have
attracted a lot of attention due to the emerging personal
healthcare. However, image-based food recognition is a
challenging task because of the variety of food’s
appearance even though images captured from the same
food class. In this work, instead of the use of feature-
based approach, patch-based visual appearance is
employed directly. Then sparse coding is used for
dictionary learning. Moreover, the atom distribution is
also computed for further classifier training that is
conducted by SVM (support vector machine).
Experiment results show that the recognition rate about
90% can be achieved when the target class is recognized
among the top 2 ranking. That shows our proposed
approach is practical for real world environment.
Keywords Food recognition; Sparse Coding; Support
Vector Machine
1. INTRODUCTION
Visual-based food recognition is one of the emerging
applications of object recognition technology, because it
will help estimate food calories and analyze people's
eating habits for personal healthcare. Therefore, several
works have been developed so far [1-8]. Research in the
computer vision community has explored the recognition
of either a small sub-set of food types in controlled
laboratory environments [1-2] or food images obtained
from the web [3]. However, there have been only a few
implemented systems that address the challenge of food
recognition from images captured in real world
environment [7]. Moreover, most of them employed
feature-based approach, such as SIFT [8]. This kind of
method could work well in constrained environment. The
most difficult thing of this kind of method is to find an
invariant feature that is robust to distinct kinds of visual
appearance resulted fromdifferent food placement. Food
placement in real-world environment is basically random.
Therefore, in this work, to overcome this problem the
patch-based visual appearance is used directly without
the use of previous derived features. The rest of this
work is organized as follows. In Section 2, the proposed
patch-based food recognition using sparse coding is
introduced. In Section 3, preliminary experiment results
are presented and some concluding remarks are drawn in
Section 4.
2. PATCH-BASED FOOD RECOGNITION USING
SPARSECODING
Fig. 1. Overview of the proposed approach
In what follows, we shall describe the proposed
approach. The workflow is shown in Fig. 1. Images taken
from CCD camera are first transformed to HSV color
space and patches empirically set to 1616 are directly

extracted from HSV color channels. Sparse coding is
then selected for dictionary learning for each food
category. Consequently, the atom probability
distribution for each training sample is computed for
further classifier training that is conducted by SVM.
2.1 Dictionary Learning Using Sparse Coding
Fig. 2. Examples of 25 food classes
In the current work, totally 25 food categories that are
often appear in meal boxes are selected as they are
demonstrated in Fig. 2. To learn the dictionary for the
input image i, we apply dictionary learning technique via
sparse coding [9-10] with the training patches extracted
from training samples themselves to learn dictionary Df.
Sparse coding is the technique of finding a sparse
representation for a signal with a small number of
nonzero or significant coefficients corresponding to the
atoms in a dictionary.
Here, we intend to construct a dictionary Df
containing the local structure of textures for sparsely
representing each patch. To achieve visual appearance
representation from color images, we transform each
color image patch from three-dimension to one-
dimension. By extracting a set of training patches
, k = 1, 2, …, p, from training samples, learning
of the dictionary Df can be achieved by solving the
following optimization problem:
, (1)
where denotes the sparse coefficients of with
respect to Df , and is a regularization parameter. in our
method, an efficient online dictionary learning algorithm
proposed in [10] is used to solve eq. (1). However, to
consider the computational complexity of dictionary
learning, we propose to use the smaller size of the “mini-
batch” parameter used in the online dictionary learning
algorithm 0[10] to significantly reduce the dictionary
learning complexity. The dictionaries learned for 25 food
categories are demonstrated in Fig. 3.
Fig. 3. 25 dictionaries obtained using sparse coding
After obtaining the dictionaries, we formulate the
problemof food recognition as a sparse coding problem
as follows:
,(2)
where represents the k-th patch. are the
sparse coefficients of with respect to ,
, and l denotes the sparsity or maximum number
of nonzero coefficients of . Since l0-minimization is
hard to optimize, based on [11-12], solving the l0-
minimization problem in eq. (2) can be cast to solve the
following l1-minimization problem:
,(3)
where denotes the solution minimizing eq. (3) and
is a regularization parameter. To solve (3), we apply a
very efficient implementation for sparse coding provided
in [10]. Each patch can then be reconstructed and
used to recover depending on the corresponding
nonzero coefficients in .
2.2 Atom-Frequency based Food Category Feature
Extraction
After obtaining the sparse coefficients for each
training sample, ideally its corresponding category can
be recognized according to its atom distribution.
However, as demonstrated in Fig. 4, patches extracted
from a category could have diverse distribution over 25
categories. Therefore, in order to overcome this problem,

the atom distribution of each training sample is
considered as a feature vector for further classifier
training using SVM.
0 500 1000 1500 2000 2500 3000 3500
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
Fig. 4. An example of atomdistribution obtained using a
testing image
3. EXPERIMENT RESULTS
To evaluate the performance of the proposed patch-
based food image recognition, the proposed method was
implemented in MATLAB®
on a personal computer
equipped with Intel®
Core™ i3-4130 CPU @ 3.40GHz
processor and 4 GB memory. The parameter settings of
the proposed method are described as follows. In the
dictionary learning step, we used the online dictionary
learning implementation provided in [10] with the
suggested regularization parameter λ used in Eq. (1) set
to 0.15. In addition, the dictionary size (number of atoms)
used in the online dictionary learning for our method is
set to 128 since we can observe from Fig. 5 that this
setting has the best performance among five dictionary
sizes 64, 128, 256, 384 and 512.
Fig. 5. Precision obtained by varying the number of
atoms per dictionary
In the sparse coding step, the implementation with the
number of nonzero coefficients set to at most 10 (L = 10
in Eq. (2)) as suggested in 0 was employed. The patch
size for each test image, and the number of dictionary
training iterations are set to , and 100,
respectively. A smaller value of L leads to lower
computational complexity, but fewer employed atoms in
the dictionary, which may degrade the performance of
food recognition. On the contrary, larger L leads to
higher computational complexity, but the performance
improvement will be saturated when L exceeds a certain
number (about 10 in our experiments). Similar
characteristics are also valid for the parameter settings of
the dictionary size (number of atoms) and the number of
dictionary training iterations.
For the training dataset, we collect 50 samples for each
category with the resolution 640480. For performance
evaluation, the precision is evaluated fromtop 1 to top 5
rankings since it is challenging to have the target
recognized in exactly ranked top 1. In Fig. 6, we can
observe that we achieved about 65% for top-1 accuracy
and larger than 90% for top-2 accuracy. In addition, for
top-3 to top-5 accuracy we achieved larger than 96%.
The results show that the proposed approach is
promising for real world applications.
F
ig. 6. Precision evaluation fromtop 1 to top 5 rankings
For the miss-classifications, some examples are
demonstrated in Fig. 7. Noodles in Fig. 7(a) are classified
as loofah since most of their patches are similar in their
local visual appearance. Despite of some light green
patches in Fig. 7(b), most of loofah patches are shown
with white patches those are similar to the noodles’
patches. Fig. 7(c) shows some pieces of sweet potato
that are highly similar to the pumpkin in Fig. 7(d) with
almost patches in orange-like color and even the shape
of pieces is similar as well. In Fig. 7(e), the steamed
squash has similar color, piece size and shape while
comparing it to the steamed potato shown in Fig. 7(f).
For the elapsed time needed for the proposed approach,
we conduct the experiment using different test sample
sizes with being varied from resolution 100100 to
400400. In Table 1, on average 31 seconds are still
needed for the lowest image resolution among four
settings. For the test sample size up to 400400, on
average we need even 510 seconds to obtain the

recognition result. It is clear that the proposed approach
in its current form cannot achieve food recognition in
real-time manner because the process of the coefficient
computation of sparse coding is with high computation
complexity. However, from the experiment results, our
proposed approach can achieve over 90% accuracy with
top-2 rankings being considered.
(a) (b)
(c) (d)
(e) (f)
Fig. 7. Examples of miss-classifications: (a) noodles
recognized as (b) loofah; (c) sweet potato recognized as
(d) pumpkin; (e) squash recognized as (f) potato
Table 1. Elapsed Time Evaluated from Different Test
Sample Sizes
Test
Sample
Size
100
100
200
200
300
300
400
400
Elapsed
Time(sec)
31 205 345 510
4. CONCLUSION
In this work, instead of the use of feature-based
approach, patch-based visual appearance has been
employed directly. Then sparse coding has been used
for dictionary learning. Moreover, the atom distribution
has been computed for further classifier training that was
conducted by SVM. Experiment results have shown that
the recognition rate about 65% and 90% has been
achieved when the target class is recognized in the top 1
and among the top 2 rankings, respectively. That shows
our proposed approach is feasible for real world
applications.
REFERENCES
[1] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and
J. Yang, “Pfid: Pittsburgh Fast-food Image Dataset,” Proc.
IEEE International Conference on Image Processing, 2009.
[2] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar,
“Food Recognition Using Statistics of Pairwise Local
Features,” Proc. IEEE International Conference on Computer
Vision and Pattern Recognition, 2010.
[3] H. Hoashi, T. Joutou, and K. Yanai, “Image Recognition of
85 Food Categories by Feature Fusion,” Proc. IEEE
International Symposium on Multimedia, 2010.
[4] Y. Kawano, and K. Yanai, “Foodcam: A Real-time Food
Recognition System on A Smartphone,” Multimedia Tools and
Applications, Vol. 24, 2014.
[5] Y. Matsuda, H. Hoashi , and K. Yanai, “Recognition of
Multiple-food Images by Detecting Candidate Regions,” Proc.
IEEE International Conference on Multimedia and Expo, 2012.
[6] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar,
“Food Recognition Using Statistics of Pairwise Local
Features,” Proc. IEEE International Conference on Computer
Vision and Pattern Recognition, 2010.
[7] K. Kitamura, C. de Silva, T. Yamasaki, and K. Aizawa,
“Image processing based approach to food balance analysis for
personal food logging,” Proc. IEEE International Conference on
Multimedia, 2010.
[8] V. Bettadapura, E. Thomaz, A. Parnami, G. D. Abowd,
and I. Essa, “Leveraging Context to Support Automated Food
Recognition in Restaurants,” Proc. IEEE Winter Conference on
Applications of Computer Vision, 2015.
[9] M. Aharon, M. Elad, and A. M. Bruckstein, “The K-SVD:
an algorithm for designing of overcomplete dictionaries for
sparse representation,” IEEE Trans. Signal Process., vol. 54,
no. 11, pp. 4311–4322, Nov. 2006.
[10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online
learning for matrix factorization and sparse coding,” J. Mach.
Learn. Res., vol. 11, pp. 19–60, 2010.
[11] D. L. Donoho, “Compressed sensing,” IEEE Trans. Info.
Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006.
[12] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From
sparse solutions of systems of equations to sparse modeling of
signals and images,” SIAM Rev., vol. 51, no. 1, pp. 34–81, Feb.
2009.

CVGIP-2015

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (11)

Similaire à CVGIP-2015

Similaire à CVGIP-2015 (20)

CVGIP-2015