SlideShare une entreprise Scribd logo
1  sur  5
VISUAL APPEARANCE-BASED FOOD RECOGNITION
USING SPARSE CODING
1
Duan-Yu Chen (陳敦裕), 1
Hao-Syuan Wang (王皓玄),
2
Yue-Min Jiang (蔣岳珉) and
2
Szu-Han Tsao (曹思漢)
1
Dept. of Electrical Engineering, Yuan Ze University, Taiwan
2
Industrial Technology Research Institute, SSTC, Taiwan, ROC
E-mail: dychen@saturn.yzu.edu.tw, s1000654@mail.yzu.edu.tw, jongfat@itri.org.tw,
alfredtzao@itri.org.tw
ABSTRACT
In recent years, food recognition techniques have
attracted a lot of attention due to the emerging personal
healthcare. However, image-based food recognition is a
challenging task because of the variety of food’s
appearance even though images captured from the same
food class. In this work, instead of the use of feature-
based approach, patch-based visual appearance is
employed directly. Then sparse coding is used for
dictionary learning. Moreover, the atom distribution is
also computed for further classifier training that is
conducted by SVM (support vector machine).
Experiment results show that the recognition rate about
90% can be achieved when the target class is recognized
among the top 2 ranking. That shows our proposed
approach is practical for real world environment.
Keywords Food recognition; Sparse Coding; Support
Vector Machine
1. INTRODUCTION
Visual-based food recognition is one of the emerging
applications of object recognition technology, because it
will help estimate food calories and analyze people's
eating habits for personal healthcare. Therefore, several
works have been developed so far [1-8]. Research in the
computer vision community has explored the recognition
of either a small sub-set of food types in controlled
laboratory environments [1-2] or food images obtained
from the web [3]. However, there have been only a few
implemented systems that address the challenge of food
recognition from images captured in real world
environment [7]. Moreover, most of them employed
feature-based approach, such as SIFT [8]. This kind of
method could work well in constrained environment. The
most difficult thing of this kind of method is to find an
invariant feature that is robust to distinct kinds of visual
appearance resulted fromdifferent food placement. Food
placement in real-world environment is basically random.
Therefore, in this work, to overcome this problem the
patch-based visual appearance is used directly without
the use of previous derived features. The rest of this
work is organized as follows. In Section 2, the proposed
patch-based food recognition using sparse coding is
introduced. In Section 3, preliminary experiment results
are presented and some concluding remarks are drawn in
Section 4.
2. PATCH-BASED FOOD RECOGNITION USING
SPARSECODING
Fig. 1. Overview of the proposed approach
In what follows, we shall describe the proposed
approach. The workflow is shown in Fig. 1. Images taken
from CCD camera are first transformed to HSV color
space and patches empirically set to 1616 are directly
extracted from HSV color channels. Sparse coding is
then selected for dictionary learning for each food
category. Consequently, the atom probability
distribution for each training sample is computed for
further classifier training that is conducted by SVM.
2.1 Dictionary Learning Using Sparse Coding
Fig. 2. Examples of 25 food classes
In the current work, totally 25 food categories that are
often appear in meal boxes are selected as they are
demonstrated in Fig. 2. To learn the dictionary for the
input image i, we apply dictionary learning technique via
sparse coding [9-10] with the training patches extracted
from training samples themselves to learn dictionary Df.
Sparse coding is the technique of finding a sparse
representation for a signal with a small number of
nonzero or significant coefficients corresponding to the
atoms in a dictionary.
Here, we intend to construct a dictionary Df
containing the local structure of textures for sparsely
representing each patch. To achieve visual appearance
representation from color images, we transform each
color image patch from three-dimension to one-
dimension. By extracting a set of training patches
, k = 1, 2, …, p, from training samples, learning
of the dictionary Df can be achieved by solving the
following optimization problem:
, (1)
where denotes the sparse coefficients of with
respect to Df , and is a regularization parameter. in our
method, an efficient online dictionary learning algorithm
proposed in [10] is used to solve eq. (1). However, to
consider the computational complexity of dictionary
learning, we propose to use the smaller size of the “mini-
batch” parameter used in the online dictionary learning
algorithm 0[10] to significantly reduce the dictionary
learning complexity. The dictionaries learned for 25 food
categories are demonstrated in Fig. 3.
Fig. 3. 25 dictionaries obtained using sparse coding
After obtaining the dictionaries, we formulate the
problemof food recognition as a sparse coding problem
as follows:
,(2)
where represents the k-th patch. are the
sparse coefficients of with respect to ,
, and l denotes the sparsity or maximum number
of nonzero coefficients of . Since l0-minimization is
hard to optimize, based on [11-12], solving the l0-
minimization problem in eq. (2) can be cast to solve the
following l1-minimization problem:
,(3)
where denotes the solution minimizing eq. (3) and
is a regularization parameter. To solve (3), we apply a
very efficient implementation for sparse coding provided
in [10]. Each patch can then be reconstructed and
used to recover depending on the corresponding
nonzero coefficients in .
2.2 Atom-Frequency based Food Category Feature
Extraction
After obtaining the sparse coefficients for each
training sample, ideally its corresponding category can
be recognized according to its atom distribution.
However, as demonstrated in Fig. 4, patches extracted
from a category could have diverse distribution over 25
categories. Therefore, in order to overcome this problem,
the atom distribution of each training sample is
considered as a feature vector for further classifier
training using SVM.
0 500 1000 1500 2000 2500 3000 3500
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
Fig. 4. An example of atomdistribution obtained using a
testing image
3. EXPERIMENT RESULTS
To evaluate the performance of the proposed patch-
based food image recognition, the proposed method was
implemented in MATLAB®
on a personal computer
equipped with Intel®
Core™ i3-4130 CPU @ 3.40GHz
processor and 4 GB memory. The parameter settings of
the proposed method are described as follows. In the
dictionary learning step, we used the online dictionary
learning implementation provided in [10] with the
suggested regularization parameter λ used in Eq. (1) set
to 0.15. In addition, the dictionary size (number of atoms)
used in the online dictionary learning for our method is
set to 128 since we can observe from Fig. 5 that this
setting has the best performance among five dictionary
sizes 64, 128, 256, 384 and 512.
Fig. 5. Precision obtained by varying the number of
atoms per dictionary
In the sparse coding step, the implementation with the
number of nonzero coefficients set to at most 10 (L = 10
in Eq. (2)) as suggested in 0 was employed. The patch
size for each test image, and the number of dictionary
training iterations are set to , and 100,
respectively. A smaller value of L leads to lower
computational complexity, but fewer employed atoms in
the dictionary, which may degrade the performance of
food recognition. On the contrary, larger L leads to
higher computational complexity, but the performance
improvement will be saturated when L exceeds a certain
number (about 10 in our experiments). Similar
characteristics are also valid for the parameter settings of
the dictionary size (number of atoms) and the number of
dictionary training iterations.
For the training dataset, we collect 50 samples for each
category with the resolution 640480. For performance
evaluation, the precision is evaluated fromtop 1 to top 5
rankings since it is challenging to have the target
recognized in exactly ranked top 1. In Fig. 6, we can
observe that we achieved about 65% for top-1 accuracy
and larger than 90% for top-2 accuracy. In addition, for
top-3 to top-5 accuracy we achieved larger than 96%.
The results show that the proposed approach is
promising for real world applications.
F
ig. 6. Precision evaluation fromtop 1 to top 5 rankings
For the miss-classifications, some examples are
demonstrated in Fig. 7. Noodles in Fig. 7(a) are classified
as loofah since most of their patches are similar in their
local visual appearance. Despite of some light green
patches in Fig. 7(b), most of loofah patches are shown
with white patches those are similar to the noodles’
patches. Fig. 7(c) shows some pieces of sweet potato
that are highly similar to the pumpkin in Fig. 7(d) with
almost patches in orange-like color and even the shape
of pieces is similar as well. In Fig. 7(e), the steamed
squash has similar color, piece size and shape while
comparing it to the steamed potato shown in Fig. 7(f).
For the elapsed time needed for the proposed approach,
we conduct the experiment using different test sample
sizes with being varied from resolution 100100 to
400400. In Table 1, on average 31 seconds are still
needed for the lowest image resolution among four
settings. For the test sample size up to 400400, on
average we need even 510 seconds to obtain the
recognition result. It is clear that the proposed approach
in its current form cannot achieve food recognition in
real-time manner because the process of the coefficient
computation of sparse coding is with high computation
complexity. However, from the experiment results, our
proposed approach can achieve over 90% accuracy with
top-2 rankings being considered.
(a) (b)
(c) (d)
(e) (f)
Fig. 7. Examples of miss-classifications: (a) noodles
recognized as (b) loofah; (c) sweet potato recognized as
(d) pumpkin; (e) squash recognized as (f) potato
Table 1. Elapsed Time Evaluated from Different Test
Sample Sizes
Test
Sample
Size
100
100
200
200
300
300
400
400
Elapsed
Time(sec)
31 205 345 510
4. CONCLUSION
In this work, instead of the use of feature-based
approach, patch-based visual appearance has been
employed directly. Then sparse coding has been used
for dictionary learning. Moreover, the atom distribution
has been computed for further classifier training that was
conducted by SVM. Experiment results have shown that
the recognition rate about 65% and 90% has been
achieved when the target class is recognized in the top 1
and among the top 2 rankings, respectively. That shows
our proposed approach is feasible for real world
applications.
REFERENCES
[1] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and
J. Yang, “Pfid: Pittsburgh Fast-food Image Dataset,” Proc.
IEEE International Conference on Image Processing, 2009.
[2] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar,
“Food Recognition Using Statistics of Pairwise Local
Features,” Proc. IEEE International Conference on Computer
Vision and Pattern Recognition, 2010.
[3] H. Hoashi, T. Joutou, and K. Yanai, “Image Recognition of
85 Food Categories by Feature Fusion,” Proc. IEEE
International Symposium on Multimedia, 2010.
[4] Y. Kawano, and K. Yanai, “Foodcam: A Real-time Food
Recognition System on A Smartphone,” Multimedia Tools and
Applications, Vol. 24, 2014.
[5] Y. Matsuda, H. Hoashi , and K. Yanai, “Recognition of
Multiple-food Images by Detecting Candidate Regions,” Proc.
IEEE International Conference on Multimedia and Expo, 2012.
[6] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar,
“Food Recognition Using Statistics of Pairwise Local
Features,” Proc. IEEE International Conference on Computer
Vision and Pattern Recognition, 2010.
[7] K. Kitamura, C. de Silva, T. Yamasaki, and K. Aizawa,
“Image processing based approach to food balance analysis for
personal food logging,” Proc. IEEE International Conference on
Multimedia, 2010.
[8] V. Bettadapura, E. Thomaz, A. Parnami, G. D. Abowd,
and I. Essa, “Leveraging Context to Support Automated Food
Recognition in Restaurants,” Proc. IEEE Winter Conference on
Applications of Computer Vision, 2015.
[9] M. Aharon, M. Elad, and A. M. Bruckstein, “The K-SVD:
an algorithm for designing of overcomplete dictionaries for
sparse representation,” IEEE Trans. Signal Process., vol. 54,
no. 11, pp. 4311–4322, Nov. 2006.
[10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online
learning for matrix factorization and sparse coding,” J. Mach.
Learn. Res., vol. 11, pp. 19–60, 2010.
[11] D. L. Donoho, “Compressed sensing,” IEEE Trans. Info.
Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006.
[12] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From
sparse solutions of systems of equations to sparse modeling of
signals and images,” SIAM Rev., vol. 51, no. 1, pp. 34–81, Feb.
2009.
CVGIP-2015

Contenu connexe

En vedette

The Key to Enhancing Educator Effectivness. Document
The Key to Enhancing Educator Effectivness. DocumentThe Key to Enhancing Educator Effectivness. Document
The Key to Enhancing Educator Effectivness. DocumentDoug Reznicek M.Ed.
 
Arquitectura romanaantonodeguglielmo
Arquitectura romanaantonodeguglielmoArquitectura romanaantonodeguglielmo
Arquitectura romanaantonodeguglielmoAntonio de Guglielmo
 
National Conference Report
National Conference ReportNational Conference Report
National Conference ReportSonia Mijarra
 
Absolut vodka (marketing project)
Absolut vodka (marketing project)Absolut vodka (marketing project)
Absolut vodka (marketing project)Kalpesh Ambre
 
Carlos Marx, teorías y conceptos sobre las clases sociales
Carlos Marx, teorías y conceptos sobre las clases socialesCarlos Marx, teorías y conceptos sobre las clases sociales
Carlos Marx, teorías y conceptos sobre las clases socialesSlideSCPyS
 
ιουστινιανός και το έργο του
ιουστινιανός και το έργο τουιουστινιανός και το έργο του
ιουστινιανός και το έργο τουsotirisgar
 

En vedette (11)

The Key to Enhancing Educator Effectivness. Document
The Key to Enhancing Educator Effectivness. DocumentThe Key to Enhancing Educator Effectivness. Document
The Key to Enhancing Educator Effectivness. Document
 
Mapa conceptual
Mapa conceptualMapa conceptual
Mapa conceptual
 
E12641
E12641E12641
E12641
 
Arquitectura romanaantonodeguglielmo
Arquitectura romanaantonodeguglielmoArquitectura romanaantonodeguglielmo
Arquitectura romanaantonodeguglielmo
 
National Conference Report
National Conference ReportNational Conference Report
National Conference Report
 
Kti komariah
Kti komariahKti komariah
Kti komariah
 
Jcb backhoe
Jcb backhoeJcb backhoe
Jcb backhoe
 
Startup Naming & Positioning
Startup Naming & PositioningStartup Naming & Positioning
Startup Naming & Positioning
 
Absolut vodka (marketing project)
Absolut vodka (marketing project)Absolut vodka (marketing project)
Absolut vodka (marketing project)
 
Carlos Marx, teorías y conceptos sobre las clases sociales
Carlos Marx, teorías y conceptos sobre las clases socialesCarlos Marx, teorías y conceptos sobre las clases sociales
Carlos Marx, teorías y conceptos sobre las clases sociales
 
ιουστινιανός και το έργο του
ιουστινιανός και το έργο τουιουστινιανός και το έργο του
ιουστινιανός και το έργο του
 

Similaire à CVGIP-2015

Image Reconstruction Using Sparse Approximation
Image Reconstruction Using Sparse ApproximationImage Reconstruction Using Sparse Approximation
Image Reconstruction Using Sparse ApproximationChristopher Neighbor
 
19 9742 the application paper id 0016(edit ty)
19 9742 the application paper id 0016(edit ty)19 9742 the application paper id 0016(edit ty)
19 9742 the application paper id 0016(edit ty)IAESIJEECS
 
Quality Prediction in Fingerprint Compression
Quality Prediction in Fingerprint CompressionQuality Prediction in Fingerprint Compression
Quality Prediction in Fingerprint CompressionIJTET Journal
 
3 d vision based dietary inspection for the central kitchen automation
3 d vision based dietary inspection for the central kitchen automation3 d vision based dietary inspection for the central kitchen automation
3 d vision based dietary inspection for the central kitchen automationcsandit
 
Hyperspectral unmixing using novel conversion model.ppt
Hyperspectral unmixing using novel conversion model.pptHyperspectral unmixing using novel conversion model.ppt
Hyperspectral unmixing using novel conversion model.pptgrssieee
 
Shape and Level Bottles Detection Using Local Standard Deviation and Hough Tr...
Shape and Level Bottles Detection Using Local Standard Deviation and Hough Tr...Shape and Level Bottles Detection Using Local Standard Deviation and Hough Tr...
Shape and Level Bottles Detection Using Local Standard Deviation and Hough Tr...IJECEIAES
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionYoussefKitane
 
Report on "Food recognition system using BOF model"
Report on "Food recognition system using BOF model"Report on "Food recognition system using BOF model"
Report on "Food recognition system using BOF model"madhuripallod
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...ijscmcj
 
INFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERING
INFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERINGINFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERING
INFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERINGcsandit
 
An Efficient APOA Techniques For Generalized Residual Vector Quantization Bas...
An Efficient APOA Techniques For Generalized Residual Vector Quantization Bas...An Efficient APOA Techniques For Generalized Residual Vector Quantization Bas...
An Efficient APOA Techniques For Generalized Residual Vector Quantization Bas...IJCSIS Research Publications
 
Detection and Classification in Hyperspectral Images using Rate Distortion an...
Detection and Classification in Hyperspectral Images using Rate Distortion an...Detection and Classification in Hyperspectral Images using Rate Distortion an...
Detection and Classification in Hyperspectral Images using Rate Distortion an...Pioneer Natural Resources
 
Principle Component Analysis for Classification of the Quality of Aromatic Rice
Principle Component Analysis for Classification of the Quality of Aromatic RicePrinciple Component Analysis for Classification of the Quality of Aromatic Rice
Principle Component Analysis for Classification of the Quality of Aromatic RiceIJCSIS Research Publications
 

Similaire à CVGIP-2015 (20)

Image Reconstruction Using Sparse Approximation
Image Reconstruction Using Sparse ApproximationImage Reconstruction Using Sparse Approximation
Image Reconstruction Using Sparse Approximation
 
19 9742 the application paper id 0016(edit ty)
19 9742 the application paper id 0016(edit ty)19 9742 the application paper id 0016(edit ty)
19 9742 the application paper id 0016(edit ty)
 
Quality Prediction in Fingerprint Compression
Quality Prediction in Fingerprint CompressionQuality Prediction in Fingerprint Compression
Quality Prediction in Fingerprint Compression
 
3 d vision based dietary inspection for the central kitchen automation
3 d vision based dietary inspection for the central kitchen automation3 d vision based dietary inspection for the central kitchen automation
3 d vision based dietary inspection for the central kitchen automation
 
Hyperspectral unmixing using novel conversion model.ppt
Hyperspectral unmixing using novel conversion model.pptHyperspectral unmixing using novel conversion model.ppt
Hyperspectral unmixing using novel conversion model.ppt
 
Matlab abstract 2016
Matlab abstract 2016Matlab abstract 2016
Matlab abstract 2016
 
Shape and Level Bottles Detection Using Local Standard Deviation and Hough Tr...
Shape and Level Bottles Detection Using Local Standard Deviation and Hough Tr...Shape and Level Bottles Detection Using Local Standard Deviation and Hough Tr...
Shape and Level Bottles Detection Using Local Standard Deviation and Hough Tr...
 
336 1170-1-pb tocher anderson
336 1170-1-pb  tocher anderson336 1170-1-pb  tocher anderson
336 1170-1-pb tocher anderson
 
336 1170-1-pb tocher anderson
336 1170-1-pb  tocher anderson336 1170-1-pb  tocher anderson
336 1170-1-pb tocher anderson
 
336 1170-1-pb tocher anderson
336 1170-1-pb  tocher anderson336 1170-1-pb  tocher anderson
336 1170-1-pb tocher anderson
 
Block coordinate descent__in_computer_vision
Block coordinate descent__in_computer_visionBlock coordinate descent__in_computer_vision
Block coordinate descent__in_computer_vision
 
Report on "Food recognition system using BOF model"
Report on "Food recognition system using BOF model"Report on "Food recognition system using BOF model"
Report on "Food recognition system using BOF model"
 
E1803053238
E1803053238E1803053238
E1803053238
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
BEHAVIOR STUDY OF ENTROPY IN A DIGITAL IMAGE THROUGH AN ITERATIVE ALGORITHM O...
 
INFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERING
INFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERINGINFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERING
INFLUENCE OF QUANTITY OF PRINCIPAL COMPONENT IN DISCRIMINATIVE FILTERING
 
An Efficient APOA Techniques For Generalized Residual Vector Quantization Bas...
An Efficient APOA Techniques For Generalized Residual Vector Quantization Bas...An Efficient APOA Techniques For Generalized Residual Vector Quantization Bas...
An Efficient APOA Techniques For Generalized Residual Vector Quantization Bas...
 
Detection and Classification in Hyperspectral Images using Rate Distortion an...
Detection and Classification in Hyperspectral Images using Rate Distortion an...Detection and Classification in Hyperspectral Images using Rate Distortion an...
Detection and Classification in Hyperspectral Images using Rate Distortion an...
 
F010224446
F010224446F010224446
F010224446
 
Principle Component Analysis for Classification of the Quality of Aromatic Rice
Principle Component Analysis for Classification of the Quality of Aromatic RicePrinciple Component Analysis for Classification of the Quality of Aromatic Rice
Principle Component Analysis for Classification of the Quality of Aromatic Rice
 

CVGIP-2015

  • 1. VISUAL APPEARANCE-BASED FOOD RECOGNITION USING SPARSE CODING 1 Duan-Yu Chen (陳敦裕), 1 Hao-Syuan Wang (王皓玄), 2 Yue-Min Jiang (蔣岳珉) and 2 Szu-Han Tsao (曹思漢) 1 Dept. of Electrical Engineering, Yuan Ze University, Taiwan 2 Industrial Technology Research Institute, SSTC, Taiwan, ROC E-mail: dychen@saturn.yzu.edu.tw, s1000654@mail.yzu.edu.tw, jongfat@itri.org.tw, alfredtzao@itri.org.tw ABSTRACT In recent years, food recognition techniques have attracted a lot of attention due to the emerging personal healthcare. However, image-based food recognition is a challenging task because of the variety of food’s appearance even though images captured from the same food class. In this work, instead of the use of feature- based approach, patch-based visual appearance is employed directly. Then sparse coding is used for dictionary learning. Moreover, the atom distribution is also computed for further classifier training that is conducted by SVM (support vector machine). Experiment results show that the recognition rate about 90% can be achieved when the target class is recognized among the top 2 ranking. That shows our proposed approach is practical for real world environment. Keywords Food recognition; Sparse Coding; Support Vector Machine 1. INTRODUCTION Visual-based food recognition is one of the emerging applications of object recognition technology, because it will help estimate food calories and analyze people's eating habits for personal healthcare. Therefore, several works have been developed so far [1-8]. Research in the computer vision community has explored the recognition of either a small sub-set of food types in controlled laboratory environments [1-2] or food images obtained from the web [3]. However, there have been only a few implemented systems that address the challenge of food recognition from images captured in real world environment [7]. Moreover, most of them employed feature-based approach, such as SIFT [8]. This kind of method could work well in constrained environment. The most difficult thing of this kind of method is to find an invariant feature that is robust to distinct kinds of visual appearance resulted fromdifferent food placement. Food placement in real-world environment is basically random. Therefore, in this work, to overcome this problem the patch-based visual appearance is used directly without the use of previous derived features. The rest of this work is organized as follows. In Section 2, the proposed patch-based food recognition using sparse coding is introduced. In Section 3, preliminary experiment results are presented and some concluding remarks are drawn in Section 4. 2. PATCH-BASED FOOD RECOGNITION USING SPARSECODING Fig. 1. Overview of the proposed approach In what follows, we shall describe the proposed approach. The workflow is shown in Fig. 1. Images taken from CCD camera are first transformed to HSV color space and patches empirically set to 1616 are directly
  • 2. extracted from HSV color channels. Sparse coding is then selected for dictionary learning for each food category. Consequently, the atom probability distribution for each training sample is computed for further classifier training that is conducted by SVM. 2.1 Dictionary Learning Using Sparse Coding Fig. 2. Examples of 25 food classes In the current work, totally 25 food categories that are often appear in meal boxes are selected as they are demonstrated in Fig. 2. To learn the dictionary for the input image i, we apply dictionary learning technique via sparse coding [9-10] with the training patches extracted from training samples themselves to learn dictionary Df. Sparse coding is the technique of finding a sparse representation for a signal with a small number of nonzero or significant coefficients corresponding to the atoms in a dictionary. Here, we intend to construct a dictionary Df containing the local structure of textures for sparsely representing each patch. To achieve visual appearance representation from color images, we transform each color image patch from three-dimension to one- dimension. By extracting a set of training patches , k = 1, 2, …, p, from training samples, learning of the dictionary Df can be achieved by solving the following optimization problem: , (1) where denotes the sparse coefficients of with respect to Df , and is a regularization parameter. in our method, an efficient online dictionary learning algorithm proposed in [10] is used to solve eq. (1). However, to consider the computational complexity of dictionary learning, we propose to use the smaller size of the “mini- batch” parameter used in the online dictionary learning algorithm 0[10] to significantly reduce the dictionary learning complexity. The dictionaries learned for 25 food categories are demonstrated in Fig. 3. Fig. 3. 25 dictionaries obtained using sparse coding After obtaining the dictionaries, we formulate the problemof food recognition as a sparse coding problem as follows: ,(2) where represents the k-th patch. are the sparse coefficients of with respect to , , and l denotes the sparsity or maximum number of nonzero coefficients of . Since l0-minimization is hard to optimize, based on [11-12], solving the l0- minimization problem in eq. (2) can be cast to solve the following l1-minimization problem: ,(3) where denotes the solution minimizing eq. (3) and is a regularization parameter. To solve (3), we apply a very efficient implementation for sparse coding provided in [10]. Each patch can then be reconstructed and used to recover depending on the corresponding nonzero coefficients in . 2.2 Atom-Frequency based Food Category Feature Extraction After obtaining the sparse coefficients for each training sample, ideally its corresponding category can be recognized according to its atom distribution. However, as demonstrated in Fig. 4, patches extracted from a category could have diverse distribution over 25 categories. Therefore, in order to overcome this problem,
  • 3. the atom distribution of each training sample is considered as a feature vector for further classifier training using SVM. 0 500 1000 1500 2000 2500 3000 3500 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 Fig. 4. An example of atomdistribution obtained using a testing image 3. EXPERIMENT RESULTS To evaluate the performance of the proposed patch- based food image recognition, the proposed method was implemented in MATLAB® on a personal computer equipped with Intel® Core™ i3-4130 CPU @ 3.40GHz processor and 4 GB memory. The parameter settings of the proposed method are described as follows. In the dictionary learning step, we used the online dictionary learning implementation provided in [10] with the suggested regularization parameter λ used in Eq. (1) set to 0.15. In addition, the dictionary size (number of atoms) used in the online dictionary learning for our method is set to 128 since we can observe from Fig. 5 that this setting has the best performance among five dictionary sizes 64, 128, 256, 384 and 512. Fig. 5. Precision obtained by varying the number of atoms per dictionary In the sparse coding step, the implementation with the number of nonzero coefficients set to at most 10 (L = 10 in Eq. (2)) as suggested in 0 was employed. The patch size for each test image, and the number of dictionary training iterations are set to , and 100, respectively. A smaller value of L leads to lower computational complexity, but fewer employed atoms in the dictionary, which may degrade the performance of food recognition. On the contrary, larger L leads to higher computational complexity, but the performance improvement will be saturated when L exceeds a certain number (about 10 in our experiments). Similar characteristics are also valid for the parameter settings of the dictionary size (number of atoms) and the number of dictionary training iterations. For the training dataset, we collect 50 samples for each category with the resolution 640480. For performance evaluation, the precision is evaluated fromtop 1 to top 5 rankings since it is challenging to have the target recognized in exactly ranked top 1. In Fig. 6, we can observe that we achieved about 65% for top-1 accuracy and larger than 90% for top-2 accuracy. In addition, for top-3 to top-5 accuracy we achieved larger than 96%. The results show that the proposed approach is promising for real world applications. F ig. 6. Precision evaluation fromtop 1 to top 5 rankings For the miss-classifications, some examples are demonstrated in Fig. 7. Noodles in Fig. 7(a) are classified as loofah since most of their patches are similar in their local visual appearance. Despite of some light green patches in Fig. 7(b), most of loofah patches are shown with white patches those are similar to the noodles’ patches. Fig. 7(c) shows some pieces of sweet potato that are highly similar to the pumpkin in Fig. 7(d) with almost patches in orange-like color and even the shape of pieces is similar as well. In Fig. 7(e), the steamed squash has similar color, piece size and shape while comparing it to the steamed potato shown in Fig. 7(f). For the elapsed time needed for the proposed approach, we conduct the experiment using different test sample sizes with being varied from resolution 100100 to 400400. In Table 1, on average 31 seconds are still needed for the lowest image resolution among four settings. For the test sample size up to 400400, on average we need even 510 seconds to obtain the
  • 4. recognition result. It is clear that the proposed approach in its current form cannot achieve food recognition in real-time manner because the process of the coefficient computation of sparse coding is with high computation complexity. However, from the experiment results, our proposed approach can achieve over 90% accuracy with top-2 rankings being considered. (a) (b) (c) (d) (e) (f) Fig. 7. Examples of miss-classifications: (a) noodles recognized as (b) loofah; (c) sweet potato recognized as (d) pumpkin; (e) squash recognized as (f) potato Table 1. Elapsed Time Evaluated from Different Test Sample Sizes Test Sample Size 100 100 200 200 300 300 400 400 Elapsed Time(sec) 31 205 345 510 4. CONCLUSION In this work, instead of the use of feature-based approach, patch-based visual appearance has been employed directly. Then sparse coding has been used for dictionary learning. Moreover, the atom distribution has been computed for further classifier training that was conducted by SVM. Experiment results have shown that the recognition rate about 65% and 90% has been achieved when the target class is recognized in the top 1 and among the top 2 rankings, respectively. That shows our proposed approach is feasible for real world applications. REFERENCES [1] M. Chen, K. Dhingra, W. Wu, L. Yang, R. Sukthankar, and J. Yang, “Pfid: Pittsburgh Fast-food Image Dataset,” Proc. IEEE International Conference on Image Processing, 2009. [2] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food Recognition Using Statistics of Pairwise Local Features,” Proc. IEEE International Conference on Computer Vision and Pattern Recognition, 2010. [3] H. Hoashi, T. Joutou, and K. Yanai, “Image Recognition of 85 Food Categories by Feature Fusion,” Proc. IEEE International Symposium on Multimedia, 2010. [4] Y. Kawano, and K. Yanai, “Foodcam: A Real-time Food Recognition System on A Smartphone,” Multimedia Tools and Applications, Vol. 24, 2014. [5] Y. Matsuda, H. Hoashi , and K. Yanai, “Recognition of Multiple-food Images by Detecting Candidate Regions,” Proc. IEEE International Conference on Multimedia and Expo, 2012. [6] S. Yang, M. Chen, D. Pomerleau, and R. Sukthankar, “Food Recognition Using Statistics of Pairwise Local Features,” Proc. IEEE International Conference on Computer Vision and Pattern Recognition, 2010. [7] K. Kitamura, C. de Silva, T. Yamasaki, and K. Aizawa, “Image processing based approach to food balance analysis for personal food logging,” Proc. IEEE International Conference on Multimedia, 2010. [8] V. Bettadapura, E. Thomaz, A. Parnami, G. D. Abowd, and I. Essa, “Leveraging Context to Support Automated Food Recognition in Restaurants,” Proc. IEEE Winter Conference on Applications of Computer Vision, 2015. [9] M. Aharon, M. Elad, and A. M. Bruckstein, “The K-SVD: an algorithm for designing of overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006. [10] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, 2010. [11] D. L. Donoho, “Compressed sensing,” IEEE Trans. Info. Theory, vol. 52, no. 4, pp. 1289–1306, Apr. 2006. [12] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM Rev., vol. 51, no. 1, pp. 34–81, Feb. 2009.