Deep learning for image video processing

Deep Learning for Image/Video
Processing
Yu Huang
Sunnyvale, California
yu.huang07@gmail.com

Outline
• Image denoising
• Denoiser prior
• Image deconvolution
• Image/depth superesolution
• Image restoration
• DehazeNet
• Artifact reduction
• Image enhancement
• Edge aware filters
• Joint image processing
• DeepContour
• DeepEdge
• Holistically-nested edge
detection
• Boundary detection
• Inpainting
• Colorization
• Appendix: deep learning

Image Denoising by Conv. Nets
• Image denoising is a learning problem to training Conv. Net;
– Parameter estimation to minimize the reconstruction error.
• Online learning (rather than batch learning): stochastic gradient
– Gradient update from 6x6 patches sampled from 6 different training images
• Run like greedy layer-wise training for each layer.

Image Denoising by MLP
• Denoising as learning: map noisy patches to noise-free ones;
– Patch size 17x17;
• Training with different noise types and levels:
– Sigma=25; noise as Gaussian, stripe, salt-and-pepper, coding artifact;
• Feed-forward NN: MLP;
– input layer 289-d, four hidden layers (2047-d), output layer 289-d.
– input layer 169-d, four hidden layers (511-d), output layer 169-d.
• 40 million training images from LabelMe and Berkeley segmentation!
• 1000 testing images: Mcgill, Pascal VOC 2007;
• GPU: slower than BM3D, much faster than KSVD.
• Deep learning can help: unsupervised learning from unlabelled data.

Image Restoration by CNN
• Collect a dataset of clean/corrupted image pairs which are then used to train a
specialized form of convolutional neural network.
• Given a noisy image x, predict a clean image y close to the clean image y*
– the input kernels p1 = 16, the output kernel pL = 8.
– 2 hidden layers (i.e. L = 3), each with 512 units, the middle layer kernel p2 = 1.
– W1 512 kernels of size 16x16x3, W2 512 kernels of size 1x1x512, and W3 size 8x8x512.
• This learns how to map corrupted image patches to clean ones, implicitly
capturing the characteristic appearance of noise in natural images.
– Train the weights Wl and biases bl by minimizing the mean squared error
– Minimize with SGD
• Regarded as: first patchifying the input, applying a fully-connected neural network to each
patch, and averaging the resulting output patches.

Image Restoration by CNN
• Comparison.

Image Deconvolution with Deep CNN
– Establish the connection between traditional optimization-based schemes
and a CNN architecture;
– A separable structure is used as a reliable support for robust deconvolution
against artifacts;
– The deconvolution task can be approximated by a convolutional network
by nature, based on the kernel separability theorem;
– Kernel separability is achieved via SVD;
• An inverse kernel with length 100 is enough for plausible deconv. results;
– Image deconvolution convolutional neural network (DCNN);
• Two hidden layers: h1 is 38 large-scale 1-d kernels of size 121×1, and h2 is
381x121 convolution kernels to each in h1, output is 1×1×38 kernel;
• Random-weight initialization or from the separable kernel inversion;
– Concatenation of deconvolution CNN module with denoising CNN;
• called “Outlier-rejection Deconvolution CNN (ODCNN)”;
– 2 million sharp patches together with their blurred versions in training.

Image Deconvolution with Deep CNN

Learning Deep CNN Denoiser Prior for
Image Restoration
• With the aid of variable splitting techniques, denoiser prior can be plugged
in as a modular part of model-based optimization methods to solve other
inverse problems (e.g., deblurring).
• Such an integration induces considerable advantage when the denoiser is
obtained via discriminative learning.
• Train a set of fast and effective CNN denoisers and integrate them into
model-based optimization method to solve other inverse problems.
• Use Dilated Filter to enlarge Receptive Field.
• Use Batch Normaliz. and Residual Learning to accelerate training.
• Use training samples with small size to Help avoid boundary Artifacts.
• Learning specific denoiser model with small interval noise levels.

Learning Deep CNN Denoiser Prior for
Image Restoration
• It consists of 7 layers with 3 blocks, i.e., “Dilated Convolution + ReLU” block
in the 1st layer, 5 “Dilated Convolution + Batch Normalization + ReLU” blocks
in the middle layers, and “Dilated Convolution” block in the last layer.
• The dilation factors of (3×3) dilated convolutions from 1st layer to the last
layer are set to 1, 2, 3, 4, 3, 2 and 1, respectively.
The architecture of the CNN denoiser network

Pixel Recurrent Neural Networks
• A deep neural network that sequentially predicts the pixels in an image
along the two spatial dimensions.
• It models the discrete probability of the raw pixel values and encodes
the complete set of dependencies in the image.
• Fast 2-d recurrent layers and an effective use of residual connections in
deep recurrent networks.

Pixel Recurrent Neural Networks
input-to-state and state-to-state mappings
 Row LSTM is a unidirectional layer that processes the
image row by row from top to bottom computing
features for a whole row at once; the computation is
performed with a one-dimensional convolution.
 Diagonal BiLSTM is designed to both parallelize the
computation and to capture the entire available context
for any image size. Each of the two directions of the
layer scans the image in a diagonal fashion starting from
a corner at the top and reaching the opposite corner at
the bottom. Each step in the computation computes at
once the LSTM state along a diagonal in the image.
 PixelCNN uses multiple convolutional layers that
preserve the spatial resolution; pooling layers are not
used.
 Multi-Scale PixelRNN is composed of an unconditional
PixelRNN and one or more conditional PixelRNNs.

DehazeNet by CNN for Dehaze
DehazeNet conceptually consists of four sequential operations (feature
extraction, multi-scale mapping, local extremum and non-linear regression),
which is constructed by 3 convolution layers, a max-pooling, a Maxout unit
and a BReLU activation function.

Removing rain from single images via
a deep detail network
• Removing rain streaks from individual images based on
deep CNN.
• ResNet simplifies the learning process by changing the
mapping form, so directly reduce the mapping range from
input to output, which makes the learning process easier.
• A priori image domain knowledge by focusing on high
frequency detail during training, which removes BG
interference and focuses the model on the structure of rain
in images.
• It not only has benefits for high-level vision tasks but also
can be used to solve low level imaging problems.

The five network architectures for the rain
removal problem: Direct network, Neg-
mapping, ResNet, ResNet+Neg-mapping and
the deep detail network (from left to right).
Note: SSIM of (b)–(g) are 0.774,
0.490, 0.926, 0.936, 0.938 and
0.940, respectively.

Image Restoration Using Convolutional Auto-
encoders with Symmetric Skip Connections
• Image restoration, including image denoising, super resolution, inpainting, and so on;
• A deep fully convolutional auto-encoder network for image restoration, which is a
encoding-decoding framework with symmetric convolutional-deconvolutional layers.
• The convol. layers capture abstraction of images while eliminating corruptions.
• Deconvol. layers have the ability to upsample feature maps, recover image details.
• Symmetrically link convolutional and deconvolutional layers with skip-layer
connections, with which the training converges faster and attains better results.
• These skip connections allow the signal to be back-propagated to bottom layers
directly, and thus tackles the problem of gradient vanishing, making training deep
networks easier and achieving restoration performance gains consequently.
• They pass image details from convolutional layers to deconvolutional layers, which is
beneficial in recovering the clean image.
• Using the same framework, to train models on tasks of image denoising, super
resolution removing JPEG compression artifacts, non-blind image deblurring and
image inpainting.

Image Restoration Using Convolutional Auto-
encoders with Symmetric Skip Connections
The network contains layers of symmetric convolution (encoder) and deconvolution
(decoder). Skip shortcuts are connected every a few (for instance, two) layers from
convolutional feature maps to their mirrored deconvolutional feature maps. The
response from a convolutional layer is directly propagated to the corresponding
mirrored deconvolutional layer, both forwardly and backwardly.

FormResNet: Formatted Residual
Learning for Image Restoration
• A deep CNN to tackle the image restoration
problem by learning the structured residual.
• Image restoration by learning structured details
and recovering latent clean image, from the
shared info. btw corrupted and latent images.
• A residual formatting layer to format residual to
structured info., which allows to converge faster
and boosts the performance.
• A cross-level loss net to ensure both pixel-level
accuracy and semantic-level visual quality.

FormResNet: Formatted Residual
Learning for Image Restoration
(a) FormResNet: orange block represents the formatting layer; (b) cross-level loss net: incorporate pixel-wise
L2 norm, gradient consistency, and semantic high-level features, to better describe the similarity between
network inference and ground truth label; (c) RecursiveFormResNet: takes convol. layers as the formatting
layer in (a). It can be performed in a recursive fashion. ⊕ denotes pixel-wise subtraction/summation.

Deep Convolution Networks for
Compression Artifacts Reduction
• A compact and efficient network for seamless attenuation of
different compression artifacts.
• Accelerate the model by layer decomposition and joint use of large-
stride convolutional and deconvolutional layers.
• A more general CNN framework that has a close relationship with
the conventional Multi-Layer Perceptron (MLP).
• A deeper model can be effectively trained with features learned in a
shallow network.
• Transfer learning in low-level vision problems.

Deep Convolution Networks for
Compression Artifacts Reduction
There are two main modifications based on the AR-CNN. First, the layer
decomposition splits the original “feature enhancement” layer into a “shrinking”
layer and an “enhancement” layer. Then the large-stride convolutional and
deconvolutional layers significantly decrease the spatial size of the feature maps
of the middle layers. The overall shape of the framework is like an hourglass,
which is thick at the ends and thin in the middle.

Automatic Photo Adjustment Using Deep Learning
• Explore the use of deep learning in the context of photo editing;
• Introduce an image descriptor (pixel, context and global) that accounts for the local semantics
of an image.
Middle (from top to bottom):
input image, semantic label
map and the ground truth for
the Local Xpro effect;
Left and right: color mapping
scatter plots for four semantic
regions.

The architecture of the DNN
Multi-scale spatial pooling schema
Pipeline for constructing the semantic label map

Three Stylistic Local Effects:
1. Local Xpro,
2. Foreground Pop-Out,
3. Watercolor.

Deep Bilateral Learning for Real-Time
Image Enhancement
• Inspired by bilateral grid processing and local affine color transforms.
• Using pairs of input/output images, train a CNN to predict the coefficients
of a locally-affine model in bilateral space.
• Learn to make content-dependent decisions to approximate the desired
image transformation.
• The NN consumes a LR version of the input image, produces a set of affine
transformations in bilateral space, upsamples those transformations in an
edge-preserving fashion using a new slicing node, and then applies those
upsampled transformations to the full-resolution image.

Deep Bilateral Learning for Real-Time
Image Enhancement
Perform as much computation as possible at a low resolution, while still capturing high-frequency effects at full image resolution.
It consists of two distinct streams operating at different resolutions. The LR stream processes a downsampled version of the
input I through several conv. layers so as to estimate a bilateral grid of affine coefficients. This LR stream is further split in two
paths to learn both local and global features, which are fused before making the final prediction. The global and local paths share
a common set of low-level features. In turn, the HR stream performs a minimal yet critical amount of work: it learns a grayscale
guidance map used by our new slicing node to upsample the grid of affine coefficients back to full-resolution. These per-pixel
local affine transformations are then applied to the full-resolution input, which yields the final output.

Deep Edge aware filters
• To learn a big important family of edge-aware operators from
data.
• Based on a deep CNN with a gradient domain training
procedure, to approximate various filters without knowing the
original models.
• Enable fast approximation for complex edge-aware filters and
achieves up to 200x acceleration.
• Using spatially varying filter or filter combination.
FW(I) - a unified feed-forward process, I - input image,
F - network architecture, W - network parameters.
edge-aware filtering operators - L(I)

Deep Edge aware filters
A unified learning pipeline for various edge-aware filtering techniques.

Deep Depth Super-Resolution : Learning
Depth Super-Resolution using Deep CNN
• Learn the mapping from a low resolution depth image to a high
resolution one in an end-to-end style.
• To better regularize the learned depth map, exploit the depth field
statistics and the local correlation btw depth image and color image.
• These priors are integrated in an energy minimization formulation,
where the deep NN learns the unary term, the depth field statistics
works as global model constraint and the color depth correlation is
utilized to enforce the local structure in depth images.
P extracts the gradients along X and Y directions.
The color modulated smoothness term
The total variation
Energy minimization

Deep Depth Super-Resolution : Learning
Depth Super-Resolution using Deep CNN
CNN gradually learns the high frequency components in depth images

Depth Map Super-Resolution
by Deep Multi-Scale Guidance
• Depth map super resolution in which a HR depth map is inferred from a
LR depth map and an additional HR intensity image of the same scene.
• Multi-Scale Guided convolutional network (MSG-Net) for depth map SR.
• MSG-Net complements LR depth features with HR intensity features
using a multi-scale fusion strategy.
• Such a multi-scale guidance allows the network to better adapt for
upsampling of both fine- and large-scale structures.
• Specifically, the rich hierarchical HR intensity features at different levels
progressively resolve ambiguity in depth map upsampling.
• A high-frequency domain training method to not only reduce training
time but also facilitate the fusion of depth and intensity features.

The network architecture of MS-Net for single-image super resolution.

(a) Color image. (b) Ground truth. (c) LR by 8. (d) SRCNN (e) MSG-Net

Accelerating the Super-Resolution
Convolutional Neural Network
• A compact hourglass-shape CNN structure for faster and better Super-
Resolution Convolutional Neural Network (SRCNN).
• Introduce a deconvolution layer, then the mapping is learned directly from
the original LR image (without interpolation) to the HR one.
• Reformulate the mapping layer by shrinking the input feature dimension
before mapping and expanding back afterwards.
• Adopt smaller filter sizes but more mapping layers.

Accelerating the Super-Resolution
Convolutional Neural Network
The FSRCNN consists of convolution layers and a deconvolution layer.
The convolution layers can be shared for different upscaling factors. A
specific deconvolution layer is trained for different upscaling factors.

Deep Laplacian Pyramid Networks for Fast
and Accurate Super-Resolution
• Laplacian Pyramid Super-Resolution Network (LapSRN) to
progressively reconstruct sub-band residuals of HR images.
• At each pyramid level, the model takes coarse-resolution
feature maps as input, predicts the HF residuals, and uses
transposed convolutions for upsampling to the finer level.
• Not require bicubic interpolation as the pre-processing step.
• Train the LapSRN with deep supervision using a Charbonnier
loss function and achieve high-quality reconstruction.
• The network generates multi-scale predictions in one feed-
forward pass through the progressive reconstruction,
thereby facilitates resource-aware applications.

Deep Laplacian Pyramid Networks for Fast
and Accurate Super-Resolution

Photo-Realistic Single Image Super-Resolution Using a
Generative Adversarial Network
• SRGAN, a generative adversarial network (GAN) for image superresolution (SR).
• Capable of inferring photo-realistic natural images for 4 upscaling factors.
• A perceptual loss function which consists of an adversarial loss and a content loss.
– The adversarial loss pushes the solution to the natural image manifold using a
discriminator network that is trained to differentiate between the super-resolved images
and original photo-realistic images.
– A content loss motivated by perceptual similarity instead of similarity in pixel space.
• The deep residual network is able to recover photo-realistic textures from heavily
downsampled images on public benchmarks.

Photo-Realistic Single Image Super-Resolution Using a
Generative Adversarial Network

Deep Joint Image Filtering
• Learning-based to construct a joint filter based on CNN.
• In contrast to considering only the guidance image, it selectively
transfers salient structures that are consistent in both guidance
and target images.
– The sub-networks CNNT and CNNG aim to extract informative feature
responses from the target and guidance images, respectively.
– These responses are concatenated together as input for network CNNF.
– Finally, model CNNF reconstructs the desired output by selectively
transferring main structures while suppressing inconsistent structures.

The model consists of three major components
Given M training image samples
minimizing the summed squared loss

Joint depth upsampling (8×) using different network architectures f1-f2-...
where fi is the filter size of the i-th layer. (a) GT depth map (inset: Guidance).
(b) Bicubic upsampling. (c)-(e) using CNNF. (f) using CNNT + CNNG + CNNF.

• Integration from multiple scales and semantic levels via multi-streams of
interlinked, layered, non-linear “deep” processing;
– Deep belief net with a variant of the mean-and-covariance RBM;
• Unsupervised feature learning;
– Supervised boundary prediction by feed forward NN.
Deep Neural Prediction Network for Visual Boundary

Deep Neural Prediction Network for Visual Boundary

DeepContour: A Deep Convolutional Feature Learned
by Positive-sharing Loss for Contour Detection
CNN structure: explicitly visualizing the dimensions of each network layers.
• Contour detection accuracy can be improved by instead making the use of the
deep features learned from CNNs.
• Customize the training strategy by partitioning contour (positive) data into
subclasses and fitting each subclass by different model parameters.
• A new loss function, named positive-sharing loss, in which each subclass shares
the loss for the whole positive class to learn the parameters
• It introduces an extra regularizer to emphasizes the losses for the positive and
negative classes, which facilitates to explore more discriminative features.

DeepEdge: A Multi-Scale Bifurcated Deep Network for
Top-Down Contour Detection
• Run the Canny edge detector to get candidate contour points.
• Around each candidate point, extract patches at four different scales and
simultaneously run them through the five convolutional layers of the KNet.
• Connect these convolutional layers to two separately-trained network branches.
• The first branch is trained for classification, the second is trained as a regressor.
• Outputs from these two sub-networks are averaged to produce the final score.

Holistically-Nested Edge Detection
• An edge detection algorithm that addresses two important issues:
(1) holistic image training and prediction; and (2) multi-scale and
multi-level feature learning.
• Holistically-nested edge detection (HED), performs image-to-image
prediction by means of a deep learning model that leverages fully
CNNs and deeply-supervised nets.
• HED automatically learns rich hierarchical representations to
resolve the challenging ambiguity in edge/boundary detection.
(a) multi-stream architecture; (b) skip-layer net architecture; (c) single model on multi-scale inputs; (d)
separate training of networks; (e) holistically-nested architectures.

Holistically-Nested Edge Detection
The receptive field and stride size in
VGGNet used in HED.
Deep supervision with side output layers to produce
multi-scale dense predictions. Left: the side outputs
become progressively coarser and more “global”, while
critical object boundaries are preserved. Right: the
predictions tends to lack any discernible order (e.g. in
layers 1 and 2), and many boundaries are lost in later
stages.

BOUNDARY DETECTION USING DEEP LEARNING
A image is processed at 3 different scales in order to obtain multi-scale information. The 3 scales are
fused and sent as input to the NCuts, that delivers eigenvectors and the resulting ‘Spectral
Boundaries’. The latter are fused with the original boundary map, non-maximum suppressed, and
optionally thresholded.

BOUNDARY DETECTION USING DEEP LEARNING
Network architecture for multi-resolution HED training: 3 differently scaled versions of the input image are
provided as inputs to 3 FCNN networks that share weights - their multi-resolution outputs are fused in a late
fusion stage, extending DSN to multi-resolution training.

• A deep learning algorithm for contour detection with a fully
convolutional encoder-decoder network;
• Different from previous low-level edge detection, focuses on detecting
higher-level object contours.
• Trained e2e with refined ground truth from inaccurate polygon
annotations, yielding much higher precision in object contour
detection;
• Learned model generalizes well to unseen object classes from the
same super-categories on MS COCO and can match state-of-the-art
edge detection on BSDS500 with fine-tuning.
• By combining with the multiscale combinatorial grouping algorithm,
generate high-quality segmented object proposals, which significantly
advance the state-of-the-art with a relatively small amount of
candidates.
Object Contour Detection with a Full Conv.
Encoder-Decoder Network

Object Contour Detection with a Full Conv.
Encoder-Decoder Network
Architecture of the fully convolutional encoder-decoder network

Context Encoders: Feature Learning by Inpainting
• Unsupervised feature learning driven by context-based pixel prediction.
• By analogy with auto-encoders, Context Encoders – a convolutional neural
network trained to generate the contents of an arbitrary image region
conditioned on its surroundings.
• Context encoders need to both understand the content of the entire
image, as well as produce a plausible hypothesis for the missing part(s).
• When training context encoders, both a standard pixel-wise
reconstruction loss, as well as a reconstruction plus an adversarial loss.
• A context encoder learns a representation that captures not just
appearance but also the semantics of visual structures.

(a) Context encoder trained with joint reconstruction and adversarial loss for semantic inpainting.

(b) Context encoder trained with reconstruction loss for feature
learning by filling in arbitrary region dropouts in the input.

High-Resolution Image Inpainting using
Multi-Scale Neural Patch Synthesis
• A multi-scale neural patch synthesis approach based on joint optimization
of image content and texture constraints, which not only preserves
contextual structures but also produces high-frequency details by
matching and adapting patches with the most similar mid-layer feature
correlations of a deep classification network.
Solve for an unknown image x using two loss functions, the holistic
content loss (Ec) and the local texture loss (Et).

High-Resolution Image Inpainting using
Multi-Scale Neural Patch Synthesis
The network architecture for structured content prediction. Unlike the L2 loss
architecture, replace all ReLU/ReLU leaky layers with the ELU layer and adopted
fully-connected layers instead of channel-wise fully-connected layers. The ELU unit
makes the regression network training more stable than the ReLU leaky layers as it
can handle large negative responses during the training process.

Semantic Image Inpainting with Deep
Generative Models
• It generates the missing content by conditioning on the
available data.
• Given a trained generative model, search for the closest
encoding of the corrupted image in the latent image
manifold using context and prior losses.
• This encoding is then passed through the generative model
to infer the missing content.
• Inference is possible irrespective of how the missing
content is structured, while the SoA learning based method
requires specific info. about the holes in the training phase.
• It successfully predicts info. in large missing regions and
achieves pixel-level photorealism.

Generative Models
Framework for inpainting. (a) Given a GAN model trained on real images, iteratively
update z to find the closest mapping on the latent image manifold, based on the
designed loss functions. (b) Manifold traversing when iteratively updating z using BP. z
(0) is random initialed; z (k) denotes the result in k-th iteration; and zˆ the final solution.

Generative Models
CE: Contextual Encoder, GAN: Generative Adversarial Network

Globally and Locally Consistent Image
Completion
• With a FCN, complete images of arbitrary resolutions by filling-
in missing regions of any shape.
• To train this image completion network to be consistent, use
global and local context discriminators that are trained to
distinguish real images from completed ones.
• The global discriminator looks at the entire image to assess if it
is coherent as a whole, while the local discriminator looks only
at a small area centered at the completed region to ensure the
local consistency of the generated patches.
• The network is trained to fool both context discriminator
networks, which requires it to generate images that are
indistinguishable from real ones with regard to overall
consistency as well as in details.

Completion
Overview of learning image completion. It consists of a completion network and two auxiliary
context discriminator networks that are used only for training the completion network. The global
discriminator network takes the entire image as input, while the local discriminator network takes
only a small region around the completed area as input. Both discriminator networks are trained to
determine if an image is real or completed by the completion network, while the completion
network is trained to fool both discriminator networks.

Completion
Architecture of the image
completion network.
Architectures of the discriminators used in the model.

Completion

Generative Face Completion
• Face completion using a deep generative model, a more challenging problem;
• To generate semantically new pixels for the missing key components (e.g., eyes and
mouths) that contain large appearance variations.
• Directly generates contents for missing regions based on a neural network.
• The model is trained with a combination of a reconstruction loss, two adversarial
losses and a semantic parsing loss, which ensures pixel faithfulness and local-global
contents consistency.

Generative Face Completion
Network architecture. It consists of one generator, two discriminators and a parsing
network. The generator takes the masked image as input and outputs the generated
image. Two discriminators are learned to distinguish the synthesize contents in the mask
and whole generated image as real and fake. The parsing network, which is a pretrained
model and remains fixed, is to further ensure the new generated contents more photo-
realistic and encourage consistency between new and old pixels. Note that only the
generator is needed during the testing.

Convolutional Neural Pyramid for
Image Processing
• A principled convolutional neural pyramid (CNP)
framework for general low-level vision and image
processing tasks.
• The pyramid structure can greatly enlarge the field
while not sacrificing computation efficiency.
• Adaptive network depth and progressive upsampling
for quasi-real-time testing on VGA-size input.
• A broad set of applications, i.e. depth/RGB image
restoration, completion, noise/artifact removal, edge
refinement, image filtering, enhancement and
colorization.

Image Processing
Illustration of convol. neural pyramid. (a) shows the convol. pyramid structure. (b) and (c)
are the feature extraction and mapping components respectively. Conv(x, y) denotes the
convolution operation, where x is the kernel size and y is the number of output.

Image Processing

Learning Recursive Filters for Low-Level
Vision via a Hybrid Neural Network
• Low-level vision problems (e.g., edge-preserving filtering and
denoising) as recursive image filtering via a hybrid neural network.
• The network contains several spatially variant RNNs as equivalents
of a group of distinct recursive filters for each pixel, and a deep CNN
that learns the weights of RNNs.
• The deep CNN can learn regulations of recurrent propagation for
various tasks and effectively guides recurrent propagation over an
entire image.
• The model does not need a large number of convolutional channels
nor big kernels to learn features for low-level vision filters.
• It is significantly smaller and faster in comparison with a deep CNN
based image filter.

An illustrative example of the Hybrid NN for edge-preserving image smoothing with a single RNN.

The hybrid network that contains a group of RNNs to filter/restore an image and a
deep CNN to learn to propagate the RNNs. The process of filtering/restoration is
carried out via RNNs with two inputs and one output result. Both parts are trained
jointly in an end-to-end fashion.

Colorful Image Colorization
• A fully automatic approach that produces vibrant and realistic colorizations.
• A classification task and use class-rebalancing at training time to increase
the diversity of colors in the result.
• Colorization can be a powerful pretext task for self-supervised feature
learning, acting as a cross-channel encoder.
The network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and
ReLU layers, followed by a BatchNorm layer. The net has no pool layers. All changes in
resolution are achieved through spatial downsampling or upsampling btw conv blocks.

Colorful Image Colorization
Classification loss with rebalancing produces more accurate and vibrant results
than a regression loss or a classification loss without rebalancing.

Deep Learning
• Representation learning attempts to automatically learn good features or
representations;
• Deep learning algorithms attempt to learn multiple levels of representation
of increasing complexity/abstraction (intermediate and high level features);
• Become effective via unsupervised pre-training + supervised fine tuning;
– Deep networks trained with back propagation (without unsupervised pre-
training) perform worse than shallow networks.
• Deal with the curse of dimensionality (smoothing & sparsity) and over-
fitting (unsupervised, regularizer);
• Semi-supervised: structure of manifold assumption;
– labeled data is scarce and unlabeled data is abundant.

Why Deep Learning?
• Supervised training of deep models (e.g. many-layered Nets) is too hard
(optimization problem);
• Learn prior from unlabeled data;
• Shallow models are not for learning high-level abstractions;
• Ensembles or forests do not learn features first;
• Graphical models could be deep net, but mostly not.
• Unsupervised learning could be “local-learning”;
• Resemble boosting with each layer being like a weak learner
• Learning is weak in directed graphical models with many hidden variables;
• Sparsity and regularizer.
• Traditional unsupervised learning methods aren’t easy to learn multiple
levels of representation.
• Layer-wised unsupervised learning is the solution.
• Multi-task learning (transfer learning and self taught learning);
• Other issues: scalability & parallelism with the burden from big data.

Multi Layer Neural Network
• A neural network = running several logistic regressions at the same time;
– Neuron=logistic regression or…
• Calculate error derivatives (gradients) to refine: back propagate the error
derivative through model (the chain rule)
– Online learning: stochastic/incremental gradient descent;
– Batch learning: conjugate gradient descent.

Convolutional Neural Networks (CNN)
• CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually
images), based on spatially localized neural input;
– local receptive fields(shifted window), shared weights (weight averaging) across
the hidden units, and often, spatial or temporal sub-sampling;
– Related to generative MRF/discriminative CRF:
• CNN=Field of Experts MRF=ML inference in CRF;
– Generate ‘patterns of patterns’ for pattern recognition.
• Each layer combines (merge, smooth) patches from previous layers
– Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.
– Convolution filters: (translation invariance) unsupervised;
– Local contrast normalization: increase sparsity, improve optimization/invariance.
C layers
convolutions,
S layers pool/sample

Convolutional Neural Networks (CNN)
• Convolutional Networks are trainable multistage architectures composed of multiple stages;
• Input and output of each stage are sets of arrays called feature maps;
• At output, each feature map represents a particular feature extracted at all locations on input;
• Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;
• A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;
– A fully connected layer: softmax transfer function for posterior distribution.
• Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature
map;
• Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;
– In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;
• Feature pooling: treats each feature map separately -> a reduced-resolution output feature
map;
• Supervised training is performed using a form of SGD to minimize the prediction error;
– Gradients are computed with the back-propagation method.
• Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-
tuning.
* is discrete convolution operator

LeNet (LeNet-5)
• A layered model composed of convolution and subsampling operations
followed by a holistic representation and ultimately a classifier for
handwritten digits;
• Local receptive fields (5x5) with local connections;
• Output via a RBF function, one for each class, with 84 inputs each;
• Learning by Graph Transformer Networks (GTN);

AlexNet
• A layered model composed of convol., subsample., followed
by a holistic representation and all-in-all a landmark classifier;
• Consists of 5 convolutional layers, some of which followed by
max-pooling layers, 3 fully-connected layers with a final 1000-
way softmax;
• Fully-connected “FULL” layers: linear classifiers/matrix
multiplications;
• ReLU are rectified-linear nonlinearities on layer output, can be
trained several times faster;
• Local normalization scheme aids generalization;
• Overlapping pooling slightly less prone to overfitting;
• Data augmentation: artificially enlarge the dataset using label-
preserving transformations;
• Dropout: setting to zero output of each hidden neuron with
prob. 0.5;
• Trained by SGD with batch # 128, momentum 0.9, weight
decay 0.0005.

The network’s input is 150,528-dimensional, and the number of neurons in the network’s
remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.
AlexNet

MattNet
• Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013;
• Preprocessing: subtracting a per-pixel mean;
• Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of
the image and randomly flipped horizontally to provide more views of each example;
• SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent
overfitting;
• 65M parameters trained for 12 days on a single Nvidia GPU;
• Visualization by layered DeconvNets: project the feature activations back to the input pixel
space;
– Reveal input stimuli exciting individual feature maps at any layer;
– Observe evolution of features during training;
– Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are
important;
• DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve
structure;
• Multiple such models were averaged together to further boost performance;
• Supervised pre-training with AlexNet, then modify it to get better performance (error rate
14.8%).

Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3
color planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature
maps: (i) via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized
55x55 feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216
dimensions). The final layer: a C-way softmax function, C - number of classes.
MattNet

Top: A deconvnet layer (left) attached to
a convnet layer (right). The deconvnet
will reconstruct approximate version of
convnet features from the layer
beneath.
Bottom: Unpooling operation in the
deconvnet, using switches which record
the location of the local max in each
pooling region (colored zones) during
pooling in the convnet.
MattNet

Deep Belief Networks
• A hybrid model: can be trained as generative or
discriminative model;
• Deep architecture: multiple layers (learn features
layer by layer);
• Multi layer learning is difficult in sigmoid belief
networks.
• Top two layers are undirected connections,
Restricted Boltzmann Machine (RBM);
• Lower layers get top down directed
connections from layers above;
• Unsupervised or self-taught pre-learning provides
a good initialization;
• Greedy layer-wise unsupervised training for
RBM;
• Supervised fine-tuning
• Generative: wake-sleep algorithm (Up-down);
• Discriminative: back propagation (bottom-up);
Belief net is directed acyclic graph
composed of stochastic variables.

Deep Boltzmann Machine
• Boltzmann machine is a stochastic recurrent model, and RBM is its special case (one
hidden layer);
• Learning internal representations that become increasingly complex;
• High-level representations built from a large supply of unlabeled inputs;
• Pre-training: learning a stack of modified RBMs, which are composed to create a deep
Boltzmann machine (undirected graph);
• Generative fine-tuning: different from DBN
• Positive and negative phase
• Discriminative fine-tuning: the same to DBN
• Back propagation.

Stacked Denoising Auto-Encoder
• Denoising Auto-Encoder: Multilayer NNs with target output=input;
• Auto-encoder learns the salient variation like a nonlinear PCA;
• Stack many (may be sparse) auto-encoders in succession and train them using greedy
layer-wise unsupervised learning
• Drop the decode layer each time
• Performs better than stacking RBMs;
• Supervised training on the last layer using final features;
• (option) Supervised training on the entire network to fine- tune all weights of the
neural net;
• Empirically not quite as accurate as DBNs.

Stochastic Gradient Descent (SGD)
• The general class of estimators that arise as minimizers of sums are
called M-estimators;
• Where are stationary points of the likelihood function (or zeroes of its
derivative, the score function)?
• Online gradient descent samples a subset of summand functions at every
step;
• The true gradient of is approximated by a gradient at a single example;
• Shuffling of training set at each pass.
• There is a compromise between two forms, often called "mini-batches",
where the true gradient is approximated by a sum over a small number of
training examples.
• STD converges almost surely to a global minimum when the objective
function is convex or pseudo-convex, and otherwise converges almost
surely to a local minimum.

Back Propagation
E (f(x0,w),y0) = -log (f(x0,w)- y0).

Loss function
• Euclidean loss is used for regressing to real-valued lables [-inf,inf];
• Sigmoid cross-entropy loss is used for predicting K independent probability
values in [0,1];
• Softmax (normalized exponential) loss is predicting a single class of K mutually
exclusive classes;
– Generalization of the logistic function that "squashes" a K-dimensional vector
of arbitrary real values z to a K-dimensional vector of real values σ(z) in the
range (0, 1).
– The predicted probability for the j'th class given a sample vector x is
• Sigmoidal or Softmax normalization is a way of reducing the influence of
extreme values or outliers in the data without removing them from the
dataset.

Variable Learning Rate
• Too large learning rate
– cause oscillation in searching for the minimal point
• Too slow learning rate
– too slow convergence to the minimal point
• Adaptive learning rate
– At the beginning, the learning rate can be large when the current point is
far from the optimal point;
– Gradually, the learning rate will decay as time goes by.
• Should not be too large or too small:
– annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)
– 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a
constant.

Data Augmentation for Overfitting
• The easiest and most common method to reduce
overfitting on image data is to artificially enlarge the
dataset using label-preserving transformations;
• Perturbing an image I by transformations that leave
the underlying class unchanged (e.g. cropping and
flipping) in order to generate additional examples of
the class;
• Two distinct forms of data augmentation:
– image translation
– horizontal reflections
– changing RGB intensities

Weight Decay for Overfitting
• Weight decay or L2 regularization adds a penalty term to the error function, a term
called the regularization term: the negative log prior in Bayesian justification,
– Weight decay works as rescaling weights in the learning rule, but bias learning still the
same;
– Prefer to learn small weights, and large weights allowed if improving the original cost
function;
– A way of compromising btw finding small weights and minimizing the original cost
function;
• In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;
• L1 regularization: the weights not really useful shrink by a constant amount
toward zero;
– Act like a form of feature selection;
– Make the input filters cleaner and easier to interpret;
• L2 regularization penalizes large values strongly while L1 regularization ;
• Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium
distr. is the posterior distribution for weights & hyper-parameters;
• Hybrid Monte Carlo: gradient and sampling.

Early Stopping for Overfitting
• Steps in early stopping:
– Divide the available data into training and validation sets.
– Use a large number of hidden units.
– Use very small random initial values.
– Use a slow learning rate.
– Compute the validation error rate periodically during training.
– Stop training when the validation error rate "starts to go up".
• Early stopping has several advantages:
– It is fast.
– It can be applied successfully to networks in which the number of weights far exceeds
the sample size.
– It requires only one major decision by the user: what proportion of validation cases to
use.
• Practical issues in early stopping:
– How many cases do you assign to the training and validation sets?
– Do you split the data into training and validation sets randomly or by some systematic
algorithm?
– How do you tell when the validation error rate "starts to go up"?

Dropout and Maxout for Overfitting
• Dropout: set the output of each hidden neuron to zero w.p. 0.5.
– Motivation: Combining many different models that share parameters
succeeds in reducing test errors by approximately averaging together the
predictions, which resembles the bagging.
– The units which are “dropped out” in this way do not contribute to the
forward pass and do not participate in back propagation.
– So every time an input is presented, the NN samples a different architecture,
but all these architectures share weights.
– This technique reduces complex co-adaptations of units, since a neuron
cannot rely on the presence of particular other units.
– It is, therefore, forced to learn more robust features that are useful in
conjunction with many different random subsets of the other units.
– Without dropout, the network exhibits substantial overfitting.
– Dropout roughly doubles the number of iterations required to converge.
• Maxout takes the maximum across multiple feature maps;

MCMC Sampling for Optimization
• Markov Chain: a stochastic process in which future states are independent of
past states but the present state.
– Markov chain will typically converge to a stable distribution.
• Monte Carlo Markov Chain: sampling using ‘local’ information
– Devise a Markov chain whose stationary distribution is the target.
• Ergodic MC must be aperiodic, irreducible, and positive recurrent.
– Monte Carlo Integration to get quantities of interest.
• Metropolis-Hastings method: sampling from a target distribution
– Create a Markov chain whose transition matrix does not depend on the normalization term.
– Make sure the chain has a stationary distr. and it is equal to the target distr. (accept ratio).
– After sufficient number of iterations, the chain will converge the stationary distribution.
• Gibbs sampling is a special case of M-H Sampling.
– The Hammersley-Clifford Theorem: get the joint distr. from the complete conditional distr.
• Hybrid Monte Carlo: gradient sub step for each Markov chain.

Mean Field for Optimization
• Variational approximation modifies the optimization problem to
be tractable, at the price of approximate solution;
• Mean Field replaces M with a (simple) subset M(F), on which A*
(μ) is a closed form (Note: F is disconnected graph);
– Density becomes factorized product distribution in this sub-family.
– Objective: K-L divergence.
• Mean field is a structured variation approximation approach:
– Coordinate ascent (deterministic);
• Compared with stochastic approximation (sampling):
– Faster, but maybe not exact.

Contrastive Divergence for RBMs
• Contrastive divergence (CD) is proposed for training PoE first, also being
a quicker way to learn RBMs;
– Contrastive divergence as the new objective;
– Taking gradients and ignoring a term which is usually very small.
• Steps:
– Start with a training vector on the visible units.
– Then alternate between updating all the hidden units in parallel and
updating all the visible units in parallel.
• Can be applied using any MCMC algorithm to simulate the model (not
limited to just Gibbs sampling);
• CD learning is biased: not work as gradient descent
• Improved: Persistent CD explores more modes in the distribution
– Rather than from data samples, begin sampling from the mode samples,
obtained from the last gradient update.
– Still suffer from divergence of likelihood due to missing the modes.
• Score matching: the score function does not depend on its normal.
factor. So, match it b.t.w. the model with the empirical density.

“Wake-Sleep” Algorithm for DBN
• Pre-trained DBN is a generative model;
• Do a stochastic bottom-up pass (wake phase)
– Get samples from factorial distribution (visible first, then generate hidden);
– Adjust the top-down weights to be good at reconstructing the feature activities
in the layer below.
• Do a few iterations of sampling in the top level RBM
– Adjust the weights in the top-level RBM.
• Do a stochastic top-down pass (sleep phase)
– Get visible and hidden samples generated by generative model using data
coming from nowhere!
– Adjust the bottom-up weights to be good at reconstructing the feature
activities in the layer above.
– Any guarantee for improvement? No!
• The “Wake-Sleep” algorithm is trying to describe the
representation economical (Shannon’s coding theory).

Greedy Layer-Wise Training
• Deep networks tend to have more local minima problems than shallow
networks during supervised training
• Train first layer using unlabeled data
– Supervised or semi-supervised: use more unlabeled data.
• Freeze the first layer parameters and train the second layer
• Repeat this for as many layers as desire
– Build more robust features
• Use the outputs of the final layer to train the last supervised layer (leave
early weights frozen)
• Fine tune the full network with a supervised approach;
• Avoid problems to train a deep net in a supervised fashion.
– Each layer gets full learning
– Help with ineffective early layer learning
– Help with deep network local minima

Why Greedy Layer-Wise Training Works?
• Take advantage of the unlabeled data;
• Regularization Hypothesis
– Pre-training is “constraining” parameters in a region
relevant to unsupervised dataset;
– Better generalization (representations that better
describe unlabeled data are more discriminative for
labeled data) ;
• Optimization Hypothesis
– Unsupervised training initializes lower level parameters
near localities of better minima than random
initialization can.
• Only need fine tuning in the supervised learning stage.

Generative Modeling
• Have training examples x ~ pdata(x )
• Want a model that draw samples: x ~ pmodel(x )
• Where pmodel ≈ pdata
• Conditional generative models
– Speech synthesis: Text ⇒ Speech
– Machine Translation: French ⇒ English
• French: Si mon tonton tond ton tonton, ton tonton sera tondu.
• English: If my uncle shaves your uncle, your uncle will be shaved
– Image ⇒ Image segmentation
• Environment simulator
– Reinforcement learning
– Planning
• Leverage unlabeled data
x ~ pdata(x )
x ~ pmodel(x )

Adversarial Nets Framework
• A game between two
players:
– 1. Discriminator D
– 2. Generator G
• D tries to discriminate
between:
– A sample from the data
distribution.
– And a sample from the
generator G.
• G tries to “trick” D by
generating samples that
are hard for D to
distinguish from data.

GANs
• A framework for estimating generative
models via an adversarial process, to train 2
models: a generative model G that captures
the data distribution, and a discriminative
model D that estimates the probability that a
sample came from the training data rather
than G.
• The training procedure for G is to maximize
the probability of D making a mistake.
• This framework corresponds to a minimax
two-player game:
– In the space of arbitrary functions G and D, a
unique solution exists, with G recovering training
data distribution and D equal to 1/2 everywhere;
– In the case where G and D are defined by
multilayer perceptrons, the entire system can be
trained with BP.
– There is no need for any Markov chains or
unrolled approximate inference networks during
either training or generation of samples.

GANs
Rightmost column shows the nearest training example of the neighboring sample, in order to
demonstrate that the model has not memorized the training set. Samples are fair random
draws, not cherry-picked. Unlike most other visualizations of deep generative models, these
images show actual samples from the model distributions, not conditional means given
samples of hidden units. Moreover, these samples are uncorrelated because the sampling
process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully
connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator).

How to Train a GAN? Tips and Tricks
• 1. Normalize the inputs
• 2: A modified loss function
• 3: Use a spherical Z (not uniform, but
Gaussian distribution)
• 4: Batch Norm
• 5: Avoid Sparse Gradients:
– ReLU, MaxPool
• 6: Use Soft and Noisy Labels
• 7: DCGAN / Hybrid Models
– KL + GAN or VAE + GAN
• 8: Use stability tricks from RL
• 9: Use the ADAM Optimizer for
generator (SGD for discriminator)
• 10: Track failures early
– check norms of gradients
• 11: Dont balance loss via statistics
(unless you have a good reason to)
• 12: If you have labels, use them
– Auxillary GANs
• 13: Add noise to inputs, decay over
time
• 14: [not sure] Train discriminator
more (sometimes) especially have
noise
• 15: [not sure] Batch Discrimination
• 16: Discrete variables in C-GANs
• 17: Dropouts in G in both train/test
stage

Improved Techniques for Training GANs
• For semi-supervised learning in generation of images that humans find visually
realistic;
• Techniques that are heuristically motivated to encourage convergence:
– Feature matching addresses the instability of GANs by specifying a new objective for
the generator that prevents it from overtraining on the current discriminator;
– Allow the discriminator to look at multiple data examples in combination, and
perform what is called “Min-batch discrimination”: any discriminator model that
looks at multiple examples in combination, rather than in isolation, could potentially
help avoid collapse of the generator;
– Historical averaging: the historical average of the parameters can be updated in an
online fashion so this learning rule scales well to long time series;
– One sided label smoothing: reduce the vulnerability of NNs to adversarial examples;
– Virtual batch normalization: each example x is normalized based on the statistics
collected on a reference batch of examples that are chosen once and fixed at the
start of training, and on x itself (only in the generator network, cause too expensive
computationally).

Deep learning for image video processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep learning for image video processing

Similar to Deep learning for image video processing (20)

More from Yu Huang

More from Yu Huang (20)

Recently uploaded

Recently uploaded (20)

Deep learning for image video processing