口試簡報.pptx

Multimodal Deep Convolutional Neural Networks for Non-destructive
Papaya Fruit Ripeness Classification Using Digital
and Hyperspectral Imaging Systems
Cinmayii A. Garillos-Manliguez (秦瑪怡)
Ph.D. Candidate
Prof. John Y. Chiang (蔣依吾教授)
Advisor, Image Processing Laboratory
Department of Computer Science and Engineering

Outline
• Introduction
• Review of Literature
• Imaging Systems
• Hyperspectral Image
Classification
• Deep Learning
• Multimodality
• Fruit Ripeness Classification
• Materials and Methods
• Data Gathering
• Digital Image Acquisition
• Hyperspectral DataAcquisition
• Computing Environment
• Performance Evaluation Metrics
• Papaya Fruit Ripeness
Classification
• Multimodal Deep Learning
Framework
• Multimodality via Feature
Concatenation
• Multimodality via Late Fusion
• Multimodality via End-to-end
Deep Convolutional Neural
Network
• Benchmarking of Results
• Conclusion

Introduction
 Artificial intelligence, computer vision, and its
applications involve highly interdisciplinary research.
 The impact of these areas in computer science
continues to influence the technological advances in
our society today.
 Image classification is a challenging task yet it plays
a valuable role in computer vision since it is the basis
of higher-order operations

Introduction
ImageClassification
 Image classification has gained much attention in
various applications including biomedical, medical
and agricultural applications.
 It performs a critical function in cancer detection and
disease diagnosis because the results are life-
changing.
 The classification result is also significant in the
agricultural sector because it determines the market
distribution of the produce which affects the
potential gain or loss in the local or national revenue.

Introduction
 Images used in these applications come from
sensing or imaging devices that acquire the
electromagnetic energy emitted by the target
object.
 PET images of bone scans from gamma ray imaging
 X-ray images from X-ray tomography
 UV light images used in fluorescence microscopy
 Spaceborne radar images from microwave bands
 Magnetic resonance imaging (MRI) images in medicine
 Images convey information thru computer vision:
size, shape, patterns, and other salient features from
RGB or hyperspectral images
Fig. 1.1. The electromagnetic spectrum and the visible light range (Lifted from [2]).

Introduction
 Image classification techniques are successful in
unimodal or single modality implementations.
 Modality in human-computer interaction is defined
as the categorization of a distinct individual channel
of sensory input or output between a computer and
a human.
 Multimodal learning, a recent development inAI,
involves multiple data types or input modalities to
perform a given task.

Introduction
Fruit RipenessClassification
 Fruits have vital role in human health since these are
abundant in antioxidants, nutrients, vitamins,
minerals, and fiber that the human body needs.
 Consumers prefer high quality, blemish-free, and
safe fresh fruits.
 Papaya, Carica papaya L., a major tropical fruit
worldwide, gives many health and economic
benefits.

Introduction
 Fruit ripeness/maturity – physiological development as
the fruit ripens even after harvest.
 Under-mature fruit – reduced taste quality
 Over-mature fruit – overripening during transport,
prone to mechanical damages, increase susceptibility
to decay and microbial growth
 Critical factor to preserve post-harvest life, reduce
waste, and increase fruit quality
 Food losses - more than 40% occur at postharvest and
processing levels in developing countries, while more
than 40% occur at retail and consumer levels in
industrialized countries

Introduction
Efficient Fruit MaturityAssessment andGrading
 Well-trained personnel with experience – human error
 Traditional laboratory processes – destructive and time-
consuming
 Non-invasive methods through computer vision (CV)
techniques, artificial intelligence (AI), sensors and imaging
technologies provided promising results while keeping
the fruit in good shape, which help reduce post-harvest
losses
 Hyperspectral imaging has become an emerging
scientific instrument for non-destructive fruit and
vegetable quality assessment in recent years]

Main Objective
This dissertation primarily aims to propose a high-
performance multimodal framework of deep
convolutional neural network for non-destructive
papaya fruit ripeness classification that make use of
digital and hyperspectral imaging technology.

Key Contributions
(1) An original dataset of hyperspectral data cubes and digital (RGB) photos of papaya fruits, which
are categorized into six ripeness stages, acquired within a laboratory environment;
(2) An analysis of multimodality in deep learning through implementing feature concatenation
approach and late fusion of imaging-specific deep CNNs to an agricultural application, particularly
to non-destructive papaya fruit ripeness classification assessed in terms of F1-score, accuracy, top-
2 error rate, precision, recall, training time, and average prediction time; and
(3) A multimodal deep CNN architecture producing outstanding performance by discovering
abstract representations of two modalities, which are an RGB image and a hyperspectral
reflectance data cube of papaya fruit.

Summary and Organization
The remaining parts of this paper is organized as:
Section (2) presents the studies related to this paper
Section (3) provides a description of a multimodal deep
learning architecture with feature concatenation and
late fusion implementations for papaya fruit
classification using hyperspectral reflectance data set
and RGB images;
Section (4) discusses the experiments, results of the
experimental work and benchmarking; and
Section (5) concludes this work and suggests future
research directions.

Review of Literature
Chapter 2

• Machine learning is a primary contributor in the
advancement of AI
• Deep learning is very streamlined nowadays despite
the challenges in cost and infrastructure.
• For instance, its applications in agriculture include
detection and classification of fruits, vegetables,
crops, and weeds in the field or inside the laboratory
or factories [5], [17], [18], estimation of fruit maturity
or ripeness using transfer learning [19], and detection
of bruise or damages on fruits [20].

Imaging Systems
Imaging technologies like hyperspectral imaging (HSI)
system and visible-light imaging have been widely
investigated as non-destructive options for food
production and agricultural applications [11][12] for:
• fruit maturity or ripeness estimation [11],
• automatic fruit grading [9], [36],
• disease detection [26], [27],
• disease classification [26], [28], and
• agro-food product inspection [21].

Imaging Systems
Table 2.1. Summary of grading
characteristics that can be identified
and assessed using visible and
hyperspectral imaging based on
these review papers [15], [21], [22].
Hyperspectral Imaging Visible Imaging
Soluble solid content
Firmness
Dry matter
Water content
Total soluble solids
Titratable acidity
Defect (Bruise)
Decay (Sour skin, canker)
Contamination
Insect damage
Mealiness
Color
Maturity (pre-harvest time)
Anthocyanins
Total chlorophyll
Lycopene
Total phenols
Total glucosilonates
Polyphenol oxidase activity
Ascorbic acid
Carotenoid
Grading by color (color evaluation)
Grading by external quality
Sorting by external quality
Color and size classification
Shape grading
Irregularity evaluation and sorting
Length
Width
Thickness
Texture
Firmness

Hyperspectral Image Classification (HIC)
• Hyperspectral imaging is a powerful tool in
acquisition of highly dimensional data with a
high level of precision
• HIC uses hyperspectral band segments (H) in
m × n × h three-dimensional images, where
m is the length and n is the width of the
spatial dimension and h is the number of
bands
• Remote sensing applications. Each band is
composed of points in a 3D matrix (x, y, and z
positions), its coordinates in longitude and
latitude and the intensity value which relates
to its radiance.

Single-image classification
• Contrary to remote sensing where one
image is composed of many labels, a
hypercube containing the entire object to be
classified may only contain one label or class.
• This is the case in this study: each HS data
cube is categorized into one maturity stage,
which is a very challenging task because of
its high dimensionality and requires a large
memory capacity and higher computing
resources.

Single-image classification
• Damage detection on blueberries [20]:
• spectral data ranging from 328.81 nm up to
1113.54 nm
• reduce the spatial channels from 1002 to
151 by subsampling, and then, reducing the
image size to 32x32 pixels to reduce
computational cost
• Fruit classification [39]
• Fruit ripeness such as persimmon [40],
strawberry [41], [42], blueberry [43],
cherry [44], kiwi [45], banana [46], and
mango [47], [48].

RelatedWorks on Single-imageClassification for Fruit Ripeness
Fruit Ripeness
Classification
Wavelength range (nm) Methodology Classification
Result
Persimmon [40] 400 – 1000 Linear Discriminant Analysis (LDA)
on three feature wavelengths (4 ripeness
stages)
95.3%
Strawberry[41],[42] 380 – 1030 and 874 – 1734 [41]
370 – 1015 [42]
SVMs on wavelengths selected by PCA
(3 ripeness stages) [41]
AlexNet on wavelengths chosen by PCA
(2 ripeness stages) [42]
85%
98.6%
Cherry [44] 874 – 1734 Genetic algorithm (GA) and multiple linear
regression (MLR)
96.4%
Mango [47], [48] 411.3 - 867.0 [48] Unmanned Ground Vehicle (UGV) with GA
and SVM
PLS-DA and CNN [48]
0.69
0.97
Papaya (This study) 470 – 900 Multimodal Deep Convolutional Neural
Networks

Deep Learning
• Deep learning is the prime mover that has
brought breakthroughs in processing single
modalities whether these are sequential like
text or speech/audio or discrete such as
images and videos.
• The convolutional layers in a deep
convolutional neural network (deep CNN)
enable it to closely find the patterns that
identify an image (requires minimal human
engineering)

Deep learning
AlexNet
• Consist of eight layers, which is faster to train and ideal for real-world
applications
• ReLU nonlinearity, overlapping pooling, dropouts
VGG16 and VGG19
• State-of-the-art methods in large-scale image recognition tasks
• Very small 3×3 convolution filters
• Increased depth, more representations, improved network accuracy
ResNet
• Implements residual learning to resolve degradation problem
• Highly modularized networks
• Stack of blocks (layer group)
ResNeXt
• Homogenous and multi-branch architecture
• Assembly of redundant building blocks with the same configuration
• Introduced a new dimension: cardinality
MobileNet
• Mobile and embedded applications
• Depth-wise separable convolutions (DSC)
MobileNetV2
• Memory-efficient implementation
• Inverted residual structure
• Thin bottleneck layers has shortcuts and lightweight DSC is used
Key features of the
deep convolutional
neural networks
used in this study

Multimodality
• Deep learning in multimodality is
characterized by its ability to automatically
extract specialized features from an input
instance.
• Feature concatenation (FC) and ensemble
method (EM) are two general approaches to
implement multimodal learning schemes.

Multimodality
• Feature concatenation (FC): A single
feature vector is generated by
integrating data from several different
modalities at an early phase
• Ensemble method (EM): a late-fusion
method that does train and learn the
characteristics of each modality before
integrating the results.

Multimodality
• Multimodality overcomes the disadvantage of RGB
images in providing 3D indicators by integrating infrared
and depth images to obtain 2D local features and 3D
sparse point cloud for geometric verification (F1-score
of 0.77) [3].
• Multimodality is used for detection and classification of
fruits in real-time [5].The system employs two
mechanisms, early fusion (0.799) and late fusion (0.838),
to use RGB and NIR images.
• In this work, the author implemented both feature
concatenation and late fusion learning strategies on
deep CNNs to enable multimodality.

Methodology
• Papaya fruit (Carica papaya L.)
samples, a total of 253 pieces, were
obtained from Kaohsiung Market in
Gushan District,Taiwan and were
ripened in a controlled environment.
• Data collection every Day 1, Day 3, Day
5 and Day 7 to ensure ripening
according to [59].
• Digital RGB images from at least four
sides: front, back, left, and right, and
two orientations of each side:
horizontal and diagonal orientation.

Methodology
• The Philippine National Standard
(PNS/BAFPS 33:2005) is the basis of
this classification standard for the
ground truth data [59].
• Sample images of the papaya fruits
classified based on the said maturity
stages are shown in Figure 3.2.
• Each description of these stages are
summarized inTable 3.1.

Digital ImageAcquisition
The camera specifications used in this
setting are:
• Canon EOS 100D
• 18.5 megapixels resolution
• 5184x3456 pixels image size
• DIGIC 5 image processor
• Tethering cable and utility software
for remote capturing of images
Light sources in a laboratory setup is
added to illuminate the object and
eliminate the shadows and other noise.

Digital ImageAcquisition
• Fig. 3.4. Example RGB images of
sampled papaya fruits.
• Each row shows 10 RGB images
from different sides and orientation
of papaya fruits that are classified
into one of the six ripeness stages:
MS1 fruits in the first row, MS2 in the
second row and so on until MS6 or
overripe fruits in the last row.
• Based on [20] and due to
computation resources constraints,
the image size is reduced to 32 × 32

Hyperspectral Image
Acquisition
• A visible/near-infrared (VNIR)
hyperspectral imaging system
with 150 bands in between 470
nm to 900 nm wavelength data
range is used in this study.
• It is composed of an imaging
unit, active cooling system,
halogen light sources, a
platform, and a computer.
• It uses linescan sensor
technology to acquire HS data
through a push-broom
scanning approach.
• Figure 3.5 is shown here.

Hyperspectral Image
Acquisition
Figure 3.6. Hyperspectral data acquisition
and preprocessing.
(a) A 3D visualization of the raw HS data,
called a hyperspectral data cube, of a
papaya sample.
(b) A 2D lateral view of a HS image with
markers on the locations of the nine 32 ×
32 × 150 patches that were extracted
from the raw data cube [20], [59], [62] .
Apex Peduncle

Hyperspectral Image
Preprocessing
Clipping
• Outliers or extremely large
values are present in some
images due to the scene and
camera geometry [61]
• To remove the noise and to
keep the values within range
[0, 1]
Fig. 3.7. Histograms from a sample band of a HS image before (a) and after
clipping the extreme values (b).

Hyperspectral Image
Preprocessing
• The composition of images per
ripeness stage became 576 data
samples with MS1 label for both HSd
and RGB data sets, 666 with MS2
label, 792 with MS3 label, 720 for
MS4 label, 909 for MS5 label, and
945 for MS6 label as shown in Figure
3.8.
• The database comprises of 4,608
RGB images having three channels
and 4,608 HSd having 150 bands
within the range of 470 nm to 900 nm
wavelength. In this study, we have
produced a total of 9,216 data
entries of HSd and RGB with 32 × 32
pixel dimensions for each channel.

Hyperspectral Image
Preprocessing
• A further visualization on the
spectral patterns and RGB image
set can be observed in Figures 3.9
and 3.10.
• Ripeness levels are distinct in
550nm to 700nm range.

Hyperspectral Image
Preprocessing
• A further visualization on the
spectral patterns and RGB image
set can be observed in Figures 3.9
and 3.10.
• The line graphs of the mean and
median across the bands of each
patch coincides most of the time,
which means that our data is highly
symmetric and non-skewed.
Fig. 3.10. Mean and median of the reflectance values of the nine HSd
patches from a sample papaya fruit with MS1 label.

Computing Environment
• Python programming language,Anaconda,Tensorflow-GPU, Keras, and other
packages for deep learning model development
• Intel Core i5-9300H 2.40 GHz CPU (8 CPUs), NVIDIAGeForce GTX 1660Ti 6GB
GPU, and 8 GB memory space running onWindows 10 Home 64-bit (10.0, Build
18362) system

Performance Evaluation Metrics
• The multimodal models implemented in this dissertation are assessed in terms of
solution quality: precision, recall, top-2 error rate, accuracy, and F1 score, and
solution time: training time and prediction time.
• Accuracy refers to the percentage of samples that are correctly labeled among all
samples examined and is calculated as:
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑃 + 𝑇𝑁)/(𝑇𝑃 + 𝐹𝑃 + 𝐹𝑁 + 𝑇𝑁)
• Precision (P) tells us how precise or exact the predictions given by the classifier
are to the target or the correct one, while recall (R) prevents our model from
producing wastes as it quantitatively measures the proportion of samples that the
classifier identifies precisely with respect to other classes.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑃)
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑃/(𝑇𝑃 + 𝐹𝑁)

Performance Evaluation Metrics
• F1-score, a robust metric frequently used in deep learning, measures the
harmonic mean of precision and recall by:
𝐹1 (𝑅, 𝑃) = 2𝑅𝑃/(𝑅 + 𝑃)
• Top-2 error rate measures the fraction of test samples in which the true label
(actual) is not among the top two classes predicted by the model.This is like the
top-1 and top-5 error rates that are used in the ImageNet LSVRC.
• In traditional sorting, farmers or sorting personnel categorize the fruit to the
ripeness stage that is either one class lower or higher than its correct ripeness
label, because there is only a minimal difference in appearances between
adjacent maturity or ripeness stages.

Papaya Fruit Ripeness
Classification
Chapter 4

Multimodal Deep Learning Framework
Fig. 4.1. Multimodal deep learning framework for papaya fruit ripeness classification.

Multimodality via Feature
Concatenation (MDL-FC)
In this subsection, the study
(1) systematically implemented
and evaluated a multimodal deep
learning architecture, specifically
deep CNNs, with respect to
precision, recall, accuracy, top-2
error rate, F1-score, depth, and
number of parameters; and
(2) suggested a multimodal deep
CNN that acquires attributes
from two sensing systems,
specifically, an RGB image and a
hyperspectral reflectance data
cube. Figure 4.2. Multimodal deep learning via feature concatenation for non-
destructive papaya fruit ripeness estimation.

A multimodal deep learning architecture of AlexNet (MD-AlexNet)

Multimodality via Feature Concatenation (MDL-FC)
Performance Results
Figure 4.5. Performance results of the preliminary training of a 2L CNN using (a) RGB dataset only and
(b) HS images dataset only.

Performance Results
Figure 4.6. Performance results of the preliminary training of a 2L CNN with RGB+HS dataset for
(a) 100 and (b) 300 epochs.

Performance Results

Multimodality via Late
Fusion (MDL-LF)
In this subsection,
(1) a multimodal deep learning via
late fusion (MDL-LF)
framework for non-destructive
classification of papaya fruit
maturity using HS and RGB
images will be discussed,
(2) batch size (B), learning rate (lr),
and number of epochs (E)
parameter setup experiment
will be inspected, and
(3) the training and prediction
performance of deep CNNs
when integrated into the MDL-
LF architecture will be
examined.
Fig. 4.8. Multimodal deep learning via late fusion for non-destructive
papaya fruit ripeness estimation.

Fusion (MDL-LF)
Multinomial logistic regression (MLR) as
meta-learner:
• can produce probabilities of multiple
independent variables belonging to
multiple categories
• makes no assumptions on normality
and linearity, and
• can select features that can
significantly improve classification
performance even when using
hyperspectral data [63].

Fusion (MDL-LF)
Multinomial logistic regression (MLR)
learns from a set of K weight vectors
and biases:
𝑝 𝐶𝑘 θ = 𝑦𝑘 θ =
exp(𝑎𝑘)
𝑗 exp(𝑎𝑗)
where 𝑎𝑘 = 𝑤𝑘
𝑇
𝜃 + 𝑏𝑘 is the activation
function for k = 1,.., K classes,
𝑤𝑘 = 𝑤𝑘1, 𝑤𝑘2, … , 𝑤𝑘𝑀
𝑇
for M
input variables,
𝜃 = 𝜃1, 𝜃2, … , 𝜃𝑀
𝑇
,
𝐶𝑘 represents the 1-of-k scheme for
each class k.
Fig. 4.9. Multinomial logistic regression implementation structure as
meta-learner of the MDL-LF framework.
𝑎𝑘 = 𝑤𝑘
𝑇
𝑥𝑟 + 𝑤𝑘
𝑇
𝑥ℎ + 𝑏𝑘

Fusion (MDL-LF)
Parameter setting experiment

Fusion (MDL-LF)
Parameter Setting Experiment
Results
• Batch size
• Number of Epochs
• Learning Rate

Unimodal base learners’ performance

Multimodal Deep Learning via Late Fusion (MDL-LF)

Multimodality via End-to-End
Deep CNN
In this subsection,
(1) novel deep CNNs based on
MDL-LF framework using
two modalities, which are HS
and RGB images, will be
explained, and
(2) the training and prediction
performance of these novel
deep CNNs for non-
destructive classification of
papaya fruit ripeness will be
examined. Fig 4.13. End-to-end deep CNN implementation of MDL-LF.
(50 bands)

Deep CNN
• Seven E2E Deep CNN were
built for this experiment.
• The blocking characteristic of
VGG architecture and its
utilization of 3x3 kernels is
added in these models; 2x2
kernels are also explored to
allow increasing the network’s
depth.
• Example graphical plot on the
right.

Deep CNN

Deep CNN
• Performance graphs of the
top-performing MD-CNNs:
(Fig. 4.14.b) 3B5L MD-CNN
(top), and (Fig. 4.14.d) 6B5L
MD-CNN (2x2 kernel)
(bottom).
• Consistent training and
validation loss patterns

Deep CNN
• Confusion matrices of the top-
performing MD-CNNs: (Fig.
4.14.b) 3B5L MD-CNN (left),
and (Fig. 4.14.d) 6B5L MD-
CNN (2x2 kernel) (right).
• Relatively high top-2 accuracy

Multimodality via End-to-End Deep CNN
Performance Results of the E2E Deep CNN

Conclusion
• The applications of deep learning and machine learning are
being used in a variety of fields, including but not limited to
precision and smart agriculture, nutrition, food safety and
quality assurance.
• Estimating a fruit's maturity or ripeness level can help assess
the vitamins, minerals, and other nutrients that are altered in
the fruit ripening.
• However, accurately classifying a papaya fruit based on the
six ripeness stages standard remains a challenge since most
changes happen internally rather than to the external
characteristics.
• Using internal properties in classification would require
destructive and time-consuming laboratory tests.

Conclusion
• With the emergence of deep learning and imaging
technologies, data with high dimensions, which correlates
with internal and external characteristics of an object such as
those produced by hyperspectral cameras, can be processed
to perform a high-level intelligent classification task without
impairing the fruit.
• In fruit maturity classification, specifically for Papaya fruit,
we have introduced two main multimodal deep learning
framework using feature concatenation and late fusion
algorithms that were implemented in this study.
• A hyperspectral imaging and a digital imaging system were
used to acquire hyperspectral data cubes and RGB images of
papaya fruit samples, respectively, at six ripeness stages
from unripe stage to overripe stage.This established a new
multimodal raw data set—a major contribution of this study.

Conclusion
• Both the feature concatenation and late fusion approaches
demonstrated very outstanding results in papaya fruit
ripeness classification
• Feature Concatenation:
• MD-VGG16 achieved F1-score = 0.90
• Late Fusion:
• VGG16-AlexNet-MLR obtained F1-score = 0.97
• 8L End-to-End MD-CNN produced F1-score = 0.97

Conclusion
• The proposed methods are quite promising for this type of
application, according to the results.
• However, there are still significant issues that the
hyperspectral classification research community must
address. Because of the high dimensionality of the data
generated by hyperspectral imaging, the number of
experimentations that may be performed is limited
depending on the computational resources.
• The cost of the imaging equipment is also a major issue,
particularly when adopting the technology for agricultural
purposes; low-cost hyperspectral imaging is now gaining
attention in research and development.

Conclusion
• In general, multimodality through imaging systems
combined with advanced deep CNN models can non-
destructively classify fruit ripeness even at finer levels such as
six ripeness stages.
• The findings in this research demonstrated the potential of
multimodal imaging systems and multimodal deep learning
in real-time classification of fruit ripeness in the production
site.

Thank you.
Stay safe and healthy.
God bless us.

口試簡報.pptx

Recommandé

Recommandé

Contenu connexe

Similaire à 口試簡報.pptx

Similaire à 口試簡報.pptx (20)

Dernier

Dernier (20)

口試簡報.pptx

Notes de l'éditeur