Mobile Vision Algorithm Localizes Skewed Nutrition Labels

The 2014 International Conference on Image Processing, Computer Vision, & Pattern Recognition (IPCV 2014)
An Algorithm for Mobile Vision-Based Localization
of Skewed Nutrition Labels that Maximizes
Specificity
Vladimir Kulyukin
Department of Computer Science
Utah State University
Logan, UT, USA
vladimir.kulyukin@usu.edu
Christopher Blay
Department of Computer Science
Utah State University
Logan, UT, USA
chris.b.blay@gmail.com
Abstract—An algorithm is presented for mobile vision-based
localization of skewed nutrition labels on grocery packages that
maximizes specificity, i.e., the percentage of true negative match-
es out of all possible negative matches. The algorithm works on
frames captured from the smartphone camera’s video stream
and localizes nutrition labels skewed up to 35-40 degrees in either
direction from the vertical axis of the captured frame. The algo-
rithm uses three image processing methods: edge detection, line
detection, and corner detection. The algorithm targets medium-
to high-end mobile devices with single or quad-core ARM sys-
tems. Since cameras on these devices capture several frames per
second, the algorithm is designed to minimize false positives ra-
ther than maximize true ones, because, at such frequent frame
capture rates, it is far more important for the overall perfor-
mance to minimize the processing time per frame. The algorithm
is implemented on the Google Nexus 7 Android 4.3 smartphone.
Evaluation was done on 378 frames, of which 266 contained NLs
and 112 did not. The algorithm’s performance, current limita-
tions, and possible improvements are analyzed and discussed.
Keywords—computer vision; nutrition label localization; mobile
computing; text spotting; nutrition management
I. Introduction
Many nutritionists and dieticians consider proactive nutri-
tion management to be a key factor in reducing and control-
ling cancer, diabetes, and other illnesses related to or caused
by mismanaged or inadequate diets. According to the U.S.
Department of Agriculture, U.S. residents have increased their
caloric intake by 523 calories per day since 1970. Misman-
aged diets are estimated to account for 30-35 percent of cancer
cases [1]. A leading cause of mortality in men is prostate can-
cer. A leading cause of mortality in women is breast cancer.
Approximately 47,000,000 U.S. residents have metabolic
syndrome and diabetes. Diabetes in children appears to be
closely related to increasing obesity levels. The current preva-
lence of diabetes in the world is estimated to be at 2.8 percent
[2]. It is expected that by 2030 the diabetes prevalence number
will reach 4.4 percent. Some long-term complications of dia-
betes are blindness, kidney failure, and amputations. Nutrition
labels (NLs) remain the main source of nutritional information
on product packages [3, 4]. Therefore, enabling customers to
use computer vision on their smartphones will likely result in
a greater consumer awareness of the caloric and nutritional
content of purchased grocery products.
Figure 1. Skewed NL with vertical axis
In our previous research, we developed a vision-based lo-
calization algorithm for horizontally or vertically aligned NLs
on smartphones [5]. The new algorithm, presented in this
paper, improves our previous algorithm in that it handles not
only aligned NLs but also those that are skewed up to 35-40
degrees from the vertical axis of the captured frame. Figure 1
shows an example of such a skewed NL with the vertical axis
of the captured frame denoted by a white line. Another im-
provement designed and implemented in the new algorithm is
the rapid detection of the presence of an NL in each frame,
which improves the run time, because the new algorithm fails
fast and proceeds to the next frame from the video stream.
The new algorithm targets medium- to high-end mobile de-
vices with single or quad-core ARM systems. Since cameras
on these devices capture several frames per second, the algo-
rithm is designed to minimize false positives rather than max-
imize true ones, because, at such frequent frame capture rates,
it is far more important to minimize the processing time per
frame.

The remainder of our paper is organized as follows. In Sec-
tion II, we present our previous work on accessible shopping
and nutrition management to give the reader a broader context
of the research and development presented in this paper. In
Section III, we outline the details of our algorithm. In Section
IV, we present the experiments with our algorithm and discuss
our results. In Section V, we present our conclusions and out-
line several directions for future work.
II. Previous Work
In 2006, our laboratory began to work on ShopTalk, a
wearable system for independent blind supermarket shopping
[6]. In 2008 - 2009, ShopTalk was ported to the Nokia E70
smartphone connected to a Bluetooth barcode pencil scanner
[7]. In 2010, we began our work on computer vision tech-
niques for eyes-free barcode scanning [8]. In 2013, we pub-
lished several algorithms for localizing skewed barcodes as
well as horizontally or vertically aligned NLs [5, 9]. The algo-
rithm presented in this paper improves the previous NL locali-
zation algorithm by relaxing the NL alignment constraint for
up to 35 to 40 degrees in either direction from the vertical
orientation axis of the captured frame.
Modern nutrition management system designers and de-
velopers assume that users understand how to collect nutri-
tional data and can be triggered into data collection with digi-
tal prompts (e.g., email or SMS). Such systems often under-
perform, because many users find it difficult to integrate nutri-
tion data collection into their daily activities due to lack of
time, motivation, or training. Eventually they turn off or ig-
nore digital stimuli [10].
To overcome these challenges, in 2012 we began to devel-
op a Persuasive NUTrition Management System (PNUTS) [5].
PNUTS seeks to shift current research and clinical practices in
nutrition management toward persuasion, automated nutrition-
al information extraction and processing, and context-sensitive
nutrition decision support. PNUTS is based on a nutrition
management approach inspired by the Fogg Behavior Model
(FBM) [10], which states that motivation alone is insufficient
to stimulate target behaviors. Even a motivated user must have
both the ability to execute a behavior and a trigger to engage
in that behavior at an appropriate place or time.
Another frequent assumption, which is not always accu-
rate, is that consumers and patients are either more skilled than
they actually are or that they can be quickly trained to obtain
the required skills. Since training is difficult and time consum-
ing, a more promising path is to make target behaviors easier
and more intuitive to execute for the average smartphone user.
Vision-based extraction of nutritional information from NLs
on product packages is a fundamental step in making proactive
nutrition management easier and more intuitive, because it
improves the user’s ability to engage into the target behavior
of collecting and processing nutritional data.
III. Skewed NL Localization Algorithm
A. Detection of Edges, Lines, and Corners
Our NL detection algorithm uses three image processing
methods: edge detection, line detection, and corner detection.
Edge detection transforms images into bitmaps where every
pixel is classified as belonging or not belonging to an edge.
The algorithm uses the Canny edge detector (CED) [11]. After
the edges are detected (see Fig. 2), the image is processed with
the Hough Transform (HT) [12] to detect lines (see Fig. 3).
The HT algorithm finds paths in images that follow general-
ized polynomials in the polar coordinate space.
Figure 2. Original NL (left); NL with edges (right)
Corner detection is done primarily for text spotting be-
cause text segments tend to contain many distinct corners.
Thus, image segments with higher concentrations of corners
are likely to contain text. Corners are detected with the dilate-
erode method [13] (see Fig. 4). Two stages of the dilate-erode
method with different 5x5 kernels are applied. Two stages of
dilate-erode with different kernels are applied. The first stage
uses a 5x5 cross dilate kernel for horizontal and vertical ex-
pansions. It then uses a 5x5 diamond erode kernel for diagonal
shrinking. The resulting image is compared with the original
and those pixels which are in the corner of an aligned rectan-
gle are found.
Figure 3. NL with edges (left); NL with lines (right)

The second stage uses a 5x5 X-shape dilate kernel to ex-
pand in the two diagonal directions. A 5x5 square kernel is
used next to erode the image and to shrink it horizontally and
vertically. The resulting image is compared with the original
and those pixels which are in a 45 degree corner are identified.
The resulting corners from both steps are combined into a
final set of detected corners.
In Fig. 4, the top sequence of images corresponds to stage
one when the cross and diamond kernels are used to detect
aligned corners. The bottom sequence of images corresponds
to stage two when the X-shape and square kernels are used to
detect 45 degree corners. Step one shows the original input of
each stage, step two is the image after dilation, step three is
the image after erosion, and step four is the difference between
the original and eroded versions. The resulting corners are
outlined in red in each step to provide a basis of how the di-
late-erode operations modify the input.
Figure 4. Corner detection steps
Fig. 5 demonstrates the dilate-erode algorithm used on an
image segment that contains text. The dilate steps are substan-
tially whiter than their inputs, because the appropriate kernel
is used to expand white pixels. Then the erode steps partially
reverse this whitening effect by expanding darker pixels. The
result is the pixels with the largest differences between the
original image and the result image. Fig. 5 shows corners de-
tected on an image segment with text.
Our previous NL localization algorithm [5] was based on
the assumption that the NL exists in the image and is horizon-
tally or vertically aligned with the smartphone’s camera. Un-
fortunately, these conditions sometimes do not hold in the real
world due to shaking hands or failing eyesight. The exact
problem that the new algorithm addresses is twofold. Does a
given input image contain a skewed NL? And, if so, within
which aligned rectangular area can the NL be localized? In
this investigation, a skewed NL is one which has been rotated
away from the vertical alignment axis by up to 35 to 40 de-
grees in either direction, i.e., left or right. An additional objec-
tive is to decrease processing time for each frame to about one
second.
Figure 5. Corner detection for text spotting
B. Corner Detection and Analysis
Before the proper NL localization begins, a rotation cor-
rection step is performed to align inputs which may be only
nearly aligned. This correction is performed by taking ad-
vantage of high numbers of horizontal lines found within NLs.
All detected lines that are horizontal within 35 to 40 degrees in
either direction (i.e., up or down) are used to compute an aver-
age horizontal rotation. This rotation is used to perform the
appropriate correcting rotation. Corner detection is executed
after the rotation. The dilate-erode corner detector is applied to
retrieve a two-dimensional bitmap where true white pixels
correspond to detected corners and all other false pixels are
black. Fig. 5 (right) shows the corners detected in the frame
shown in Fig. 5 (left).
The dilate-erode corner detector is used specifically be-
cause of its high sensitivity to contrasted text, which is why
we assume that the region is bounded by these edges contains
a large amount of text. Areas of the input image which are not
in focus do not produce a large amount of corner detection
results and tend not to lie within the needed projection bound-
aries.
Two projections are computed after the corners are detect-
ed. The projections are sums of the true pixels for each row
and column. The image row projection has an entry for each
row in the image while the image column projection has an
entry for each column in the image. The purpose of these pro-
jections is to determine boundaries for the top, bottom, left,
and right boundaries of the region in which most corners lie.
Each value of the projection is averaged together and a projec-
tion threshold is set to twice the average. Once a projection
threshold is selected, the first and last indexes of each projec-
tion greater than the threshold are selected as the boundaries
of that projection.

Figure 5. NL (left); detected corners (right)
C. Selection of Boundary Lines
After the four corner projections have been computed, the
next step is to select the Hough lines that are closest to the
boundaries selected on the basis of the four corner projections.
In two images of Fig. 6 the four light blue lines are the lines
drawn on the basis of the four corner projection counts. The
dark blue lines show the lines detected by the Hough trans-
form. In Fig. 6 (left), the bottom light blue line is initially
chosen conservatively where the row corner projections drop
below a threshold. If there is evidence that there are some
corners present after the initially selected bottom lines, the
bottom line is moved as far below as possible, as shown in
Fig. 6 (right).
Figure 6. Initial boundaries (left); Final boundaries (right)
When the bounded area is not perfectly rectangular, which
makes integration with later analysis where a rectangular area
is expected to be less straightforward. To overcome this prob-
lem, a rectangle is placed around the selected Hough boundary
lines. After the four intersection coordinates are computed,
their components are compared and combined to find a small-
est rectangle that fits around the bounded area. This rectangle
is the final result of the NL localization algorithm. As was
stated before, the four corners found by the algorithm can be
passed to other algorithms such as row dividing, word split-
ting, and OCR. Row dividing, world splitting, and OCR are
beyond the scope of this paper. Fig. 7 shows a skewed NL
localize by our algorithm.
Figure 7. Localized Skewed NL
IV. Experiments & Results
A. Experiment Design
We assembled 378 images captured from a Google Nexus
7 Android 4.3 smartphone during a typical shopping session at
a local supermarket. Of these images, 266 contained an NL
and 112 did not. Our skewed NL localization algorithm was
implemented and tested on the same platform with these im-
ages.
Figure 8. Complete (left) and partial (right) true positives
We manually categorized the results into five categories:
complete true positives, partial true positives, true negatives,
false positives, and false negatives. A complete true positive is
an image where a complete NL was localized. A partial true
positive is an image where only a part of the NL was localized
by the algorithm. Fig. 8 shows examples of complete and par-
tial true positives.
Fig. 9 shows another example of complete and partial true
positives. The image on the left was classified as a complete
true positive, because the part of the NL that was not detected

is insignificant and will likely be fixed through simple padding
in subsequent processing. The image on the right, on the other
hand, was classified as a partial true positive. While the local-
ized area does contain most of the NL, some essential text in
the left part of the NL is excluded, which will likely cause
failure in subsequent processing. In Fig. 10, the left image
technically does not include the entire NL, because the list of
ingredients is only partially included. However, we classified
it as a complete true positive since it includes the entire table
on nutrition facts. The right image of Fig. 10, on the other
hand, is classified as a partial true positive, because some parts
of the nutrition facts table is not included in the detected area.
B. Results
Of the 266 images that contained NLs, 83 were classified
as complete true positives and 27 were classified as partial
true positives, which gives a total true positive rate of 42%
and a false negative rate of 58%. All test images with no NLs
were classified as true negatives. The remainder of our analy-
sis was done via precision, recall, and specificity, and accura-
cy.
Precision is the percentage of complete true positive
matches out of all true positive matches. Recall is the percent-
age of true positive matches out of all possible positive match-
es. Specificity is the percentage of true negative matches out
of all possible negative matches. Accuracy is the percentage of
true matches out of all possible matches
Table I. NL Localization Results
PR TR CR PR SP ACC
0.7632 0.422 0.3580 0.1475 1.0 0.5916
Table I gives the NL localization results where PR stands
for “precision,” TR - for “total recall,” CR – for “complete
recall,” PR – for “partial recall,” SP – for “specificity,” and
ACC – for “accuracy.” . While total and complete recall num-
bers are somewhat low, this is a necessary trade-off of maxim-
izing specificity. Recall from Section I that we have designed
our algorithm to maximize specificity. In other words, the
algorithm is less unlikely to detect NLs in images where no
NLs are present than in images where they are present. As we
argued above, lower recall and precision may not matter much
because of the fast rate at which input images are processed on
target devices, but there is definitely room for improvement.
C. Limitations
The majority of false negative matches were caused by
blurry images. Blurry images are the result of poor camera
focus and instability. Both the Canny edge detector and dilate-
erode corner detector require rapid and contrasting changes to
identify key points and lines of interest. These points and lines
are meant to correspond directly with text and NL borders.
These useful data cannot be retrieved from blurry images,
which results in run-time detection failures. The only recourse
to deal with blurry inputs is improved camera focus and stabil-
ity, both of which are outside the scope of this algorithm, be-
cause it is a hardware problem. It is likely to work better in
later models of smartphones. The current implementation on
the Android platform attempts to force focus at the image
center but this ability to request camera focus is not present in
older Android versions. Over time, as device cameras improve
and more devices run newer versions of Android, this limita-
tion will have less impact on recall but it will never be fixed
entirely.
Bottles, bags, cans, and jars (see Fig. 11) have a large
showing in the false negative category due to Hough line de-
tection difficulties. One possibility to get around this limita-
tion is a more rigorous line detection step in which a segment-
ed Hough transform is performed and regions which contain
connecting detected lines are grouped together. These grouped
regions could be used to warp a curved image into a rectangu-
lar area for further analysis.
Smaller grocery packages (see Fig. 12) tend to have irregu-
lar NLs that place a large amount of information into tiny
spaces. NLs with irregular layouts present an extremely diffi-
cult problem for analysis. Our algorithm better handles more
traditional NL layouts with generally empty surrounding are-
as. As a better analysis of corner projections and Hough lines
is integrated into our algorithm, it will become possible to
classify inputs as definitely traditional or more irregular. If

this classification can work reliably, the method could switch
to a much slower and generalized localization to produce bet-
ter results in this situation while still quickly returning results
for more common layouts.
Figure 11. NL with curved lines
Figure 12. Irregular NLs
V. Conclusions
We have made several interesting observations during our
experiments. The row and column projects have two distinct
patterns. The row projection tends to create evenly spaced
short spikes for text in each line of text within the NL while
the column projection tends to contain one very large spike
where the NL begins at the left due to the sudden influx of
detected text. We have not performed any in-depth analysis of
these patterns. However, the projection data were collected for
each processed image. We plan to do further investigations of
these patterns, which will likely allow for run-time detection
and corresponding correction of inputs of various rotations.
For example, the column projections could be used for greater
accuracy in determining the left and right bounds of the NL
while row projections could be used by later analysis steps
such as row division. Certain projection profiles could eventu-
ally be used to select customized localization approaches at
run time.
During our experiments with and iterative development of
this algorithm, we took note of several possible improvements
that could positively affect the algorithm’s performance. First,
since input images are generally not square, the HT returns
more results for lines in the longer dimension, because they
are more likely to pass the threshold. Consequently, specifying
different thresholds for the two dimensions and combining
them for various rotations may produce more consistent re-
sults.
Second, since only those Hough lines that are nearly verti-
cal or horizontal are of use to this method, improvements can
be made by only allocating bins for those Θ and ρ combina-
tions that are considered important. Fewer bins means less
memory to track all of them and fewer tests to determine
which bins need to be incremented for a given input.
Third, both row and column corner projections tend to
produce distinct patterns which could be used to determine
better boundaries. After collecting a large amount of typical
projections, further analysis can be performed to find generali-
zations resulting in a faster method to improve boundary se-
lection.
Fourth, in principle, a much more intensive HT method
can be developed that would divide the image into a grid of
smaller segments and perform a separate HT within each seg-
ment. One advantage of this approach is to look for the
skewed, curved, or even zigzagging lines between segments
that could actually be connected into a longer line. While the
performance penalty of this method could be quite high, it
could allow for the detection and de-warping of oddly shaped
NLs. Finally, a more careful analysis of the found Hough lines
during the early rotation correction could allow us to detect
and localize NLs of all possible rotations, not just skewed
ones.
The U.S. Food and Drug Administration recently proposed
some changes to the design of NLs on product packages [14].
The new design is expected to change how serving sizes are
calculated and displayed. Percent daily values are expected to
shift to the left side of the NL, which allegedly will make them
easier to read. The new design will also require information
about added sugars as well as the counts for Vitamin D and
potassium. We would like to emphasize that this redesign,
which is expected to take at least two years, will not impact
the proposed algorithm, because the main tabular components
of the new NL design will remain the same. The nutritional
information in the new NLs will still be presented textually in
rows and columns. Therefore, the corner and line detection
and their projections will work as they work on the current NL
design.
References
[1] Anding, R. Nutrition Made Clear. The Great Courses,
Chantilly, VA, 2009.
[2] Rubin, A. L. Diabetes for Dummies. 3rd
Edition, Wiley,
Publishing, Inc. Hoboken, New Jersey, 2008.
[3] Nutrition Labeling and Education Action of 1990.
http://en.wikipedia.org/wiki/Nutrition_Labeling_and_Edu
cation_Act_of_1990.

[4] Food Labelling to Advance Better Education for Life.
Avail. at www.flabel.org/en.
[5] Kulyukin, V., Kutiyanawala, A., Zaman, T., and Clyde, S.
“Vision-based localization and text chunking of nutrition
fact tables on android smartphones,” In Proc.
International Conference on Image Processing, Computer
Vision, and Pattern Recognition (IPCV 2013), pp. 314-
320, ISBN 1-60132-252-6, CSREA Press, Las Vegas,
NV, USA, 2013.
[6] Nicholson, J. and Kulyukin, V. "ShopTalk: Independent
Blind Shopping = Verbal Route Directions + Barcode
Scans." In Proceedings of the 30-th Annual Conference of
the Rehabilitation Engineering and Assistive Technology
Society of North America (RESNA 2007), June 2007,
Phoenix, Arizona. Avail. on CD-ROM.
[7] Kulyukin, V. and Kutiyanawala, A. “Accessible Shopping
Systems for Blind and Visually Impaired Individuals:
Design Requirements and the State of the Art.” The Open
Rehabilitation Journal, ISSN: 1874-9437, Volume 2,
2010, pp. 158-168, DOI:
10.2174/1874943701003010158.
[8] Kulyukin, V., Kutiyanawala, A., and Zaman, T. "Eyes-
Free Barcode Detection on Smartphones with Niblack's
Binarization and Support Vector Machines." In
Proceedings of the 16-th International Conference on
Image Processing, Computer Vision, and Pattern
Recognition ( IPCV 2012), Vol. I, pp. 284-290, CSREA
Press, July 16-19, 2012, Las Vegas, Nevada, USA. ISBN:
1-60132-223-2, 1-60132-224-0.
[9] Kulyukin, V. and Zaman T. "Vision-Based Localization
of Skewed UPC Barcodes on Smartphones." In
Proceedings of the International Conference on Image
Processing, Computer Vision, & Pattern Recognition
(IPCV 2013), pp. 344-350, pp. 314-320, ISBN 1-60132-
252-6, CSREA Press, Las Vegas, NV, USA.
[10] B. J. Fog. "A behavior model for persuasive design," In
Proc. 4th International Conference on Persuasive
Technology, Article 40, ACM, New York, USA, 2009.
[11] Canny, J.F. “A Computational approach to edge
detection.” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 8, 1986, pp. 679-698.
[12] Duda, R. O. and P. E. Hart, "Use of the hough
transformation to detect lines and curves in pictures,"
Comm. ACM, Vol. 15, pp. 11–15, January, 1972.
[13] Laganiere, R. OpenCV 2 Computer Vision Application
Programming Cookbook. Packt Publishing Ltd, 2011.
[14] S. Tavernise. “New F.D.A nutrition labels would make
‘serving sizes’ reflect actual servings.” New York Times,
Feb. 27, 2014.

Mobile Vision Algorithm Localizes Skewed Nutrition Labels

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à Mobile Vision Algorithm Localizes Skewed Nutrition Labels

Similaire à Mobile Vision Algorithm Localizes Skewed Nutrition Labels (20)

Plus de Vladimir Kulyukin

Plus de Vladimir Kulyukin (20)

Dernier

Dernier (20)

Mobile Vision Algorithm Localizes Skewed Nutrition Labels