SlideShare a Scribd company logo
1 of 121
Download to read offline
Deep Learning for Image/Video
Processing
Yu Huang
Sunnyvale, California
yu.huang07@gmail.com
Outline
• Image denoising
• Denoiser prior
• Image deconvolution
• Image/depth superesolution
• Image restoration
• DehazeNet
• Artifact reduction
• Image enhancement
• Edge aware filters
• Joint image processing
• DeepContour
• DeepEdge
• Holistically-nested edge
detection
• Boundary detection
• Inpainting
• Colorization
• Appendix: deep learning
Image Denoising by Conv. Nets
• Image denoising is a learning problem to training Conv. Net;
– Parameter estimation to minimize the reconstruction error.
• Online learning (rather than batch learning): stochastic gradient
– Gradient update from 6x6 patches sampled from 6 different training images
• Run like greedy layer-wise training for each layer.
Image Denoising by MLP
• Denoising as learning: map noisy patches to noise-free ones;
– Patch size 17x17;
• Training with different noise types and levels:
– Sigma=25; noise as Gaussian, stripe, salt-and-pepper, coding artifact;
• Feed-forward NN: MLP;
– input layer 289-d, four hidden layers (2047-d), output layer 289-d.
– input layer 169-d, four hidden layers (511-d), output layer 169-d.
• 40 million training images from LabelMe and Berkeley segmentation!
• 1000 testing images: Mcgill, Pascal VOC 2007;
• GPU: slower than BM3D, much faster than KSVD.
• Deep learning can help: unsupervised learning from unlabelled data.
Image Restoration by CNN
• Collect a dataset of clean/corrupted image pairs which are then used to train a
specialized form of convolutional neural network.
• Given a noisy image x, predict a clean image y close to the clean image y*
– the input kernels p1 = 16, the output kernel pL = 8.
– 2 hidden layers (i.e. L = 3), each with 512 units, the middle layer kernel p2 = 1.
– W1 512 kernels of size 16x16x3, W2 512 kernels of size 1x1x512, and W3 size 8x8x512.
• This learns how to map corrupted image patches to clean ones, implicitly
capturing the characteristic appearance of noise in natural images.
– Train the weights Wl and biases bl by minimizing the mean squared error
– Minimize with SGD
• Regarded as: first patchifying the input, applying a fully-connected neural network to each
patch, and averaging the resulting output patches.
Image Restoration by CNN
• Comparison.
Image Deconvolution with Deep CNN
– Establish the connection between traditional optimization-based schemes
and a CNN architecture;
– A separable structure is used as a reliable support for robust deconvolution
against artifacts;
– The deconvolution task can be approximated by a convolutional network
by nature, based on the kernel separability theorem;
– Kernel separability is achieved via SVD;
• An inverse kernel with length 100 is enough for plausible deconv. results;
– Image deconvolution convolutional neural network (DCNN);
• Two hidden layers: h1 is 38 large-scale 1-d kernels of size 121×1, and h2 is
381x121 convolution kernels to each in h1, output is 1×1×38 kernel;
• Random-weight initialization or from the separable kernel inversion;
– Concatenation of deconvolution CNN module with denoising CNN;
• called “Outlier-rejection Deconvolution CNN (ODCNN)”;
– 2 million sharp patches together with their blurred versions in training.
Image Deconvolution with Deep CNN
Learning Deep CNN Denoiser Prior for
Image Restoration
• With the aid of variable splitting techniques, denoiser prior can be plugged
in as a modular part of model-based optimization methods to solve other
inverse problems (e.g., deblurring).
• Such an integration induces considerable advantage when the denoiser is
obtained via discriminative learning.
• Train a set of fast and effective CNN denoisers and integrate them into
model-based optimization method to solve other inverse problems.
• Use Dilated Filter to enlarge Receptive Field.
• Use Batch Normaliz. and Residual Learning to accelerate training.
• Use training samples with small size to Help avoid boundary Artifacts.
• Learning specific denoiser model with small interval noise levels.
Learning Deep CNN Denoiser Prior for
Image Restoration
• It consists of 7 layers with 3 blocks, i.e., “Dilated Convolution + ReLU” block
in the 1st layer, 5 “Dilated Convolution + Batch Normalization + ReLU” blocks
in the middle layers, and “Dilated Convolution” block in the last layer.
• The dilation factors of (3×3) dilated convolutions from 1st layer to the last
layer are set to 1, 2, 3, 4, 3, 2 and 1, respectively.
The architecture of the CNN denoiser network
Pixel Recurrent Neural Networks
• A deep neural network that sequentially predicts the pixels in an image
along the two spatial dimensions.
• It models the discrete probability of the raw pixel values and encodes
the complete set of dependencies in the image.
• Fast 2-d recurrent layers and an effective use of residual connections in
deep recurrent networks.
Pixel Recurrent Neural Networks
input-to-state and state-to-state mappings
 Row LSTM is a unidirectional layer that processes the
image row by row from top to bottom computing
features for a whole row at once; the computation is
performed with a one-dimensional convolution.
 Diagonal BiLSTM is designed to both parallelize the
computation and to capture the entire available context
for any image size. Each of the two directions of the
layer scans the image in a diagonal fashion starting from
a corner at the top and reaching the opposite corner at
the bottom. Each step in the computation computes at
once the LSTM state along a diagonal in the image.
 PixelCNN uses multiple convolutional layers that
preserve the spatial resolution; pooling layers are not
used.
 Multi-Scale PixelRNN is composed of an unconditional
PixelRNN and one or more conditional PixelRNNs.
DehazeNet by CNN for Dehaze
DehazeNet conceptually consists of four sequential operations (feature
extraction, multi-scale mapping, local extremum and non-linear regression),
which is constructed by 3 convolution layers, a max-pooling, a Maxout unit
and a BReLU activation function.
Removing rain from single images via
a deep detail network
• Removing rain streaks from individual images based on
deep CNN.
• ResNet simplifies the learning process by changing the
mapping form, so directly reduce the mapping range from
input to output, which makes the learning process easier.
• A priori image domain knowledge by focusing on high
frequency detail during training, which removes BG
interference and focuses the model on the structure of rain
in images.
• It not only has benefits for high-level vision tasks but also
can be used to solve low level imaging problems.
Removing rain from single images via
a deep detail network
Removing rain from single images via
a deep detail network
The five network architectures for the rain
removal problem: Direct network, Neg-
mapping, ResNet, ResNet+Neg-mapping and
the deep detail network (from left to right).
Note: SSIM of (b)–(g) are 0.774,
0.490, 0.926, 0.936, 0.938 and
0.940, respectively.
Image Restoration Using Convolutional Auto-
encoders with Symmetric Skip Connections
• Image restoration, including image denoising, super resolution, inpainting, and so on;
• A deep fully convolutional auto-encoder network for image restoration, which is a
encoding-decoding framework with symmetric convolutional-deconvolutional layers.
• The convol. layers capture abstraction of images while eliminating corruptions.
• Deconvol. layers have the ability to upsample feature maps, recover image details.
• Symmetrically link convolutional and deconvolutional layers with skip-layer
connections, with which the training converges faster and attains better results.
• These skip connections allow the signal to be back-propagated to bottom layers
directly, and thus tackles the problem of gradient vanishing, making training deep
networks easier and achieving restoration performance gains consequently.
• They pass image details from convolutional layers to deconvolutional layers, which is
beneficial in recovering the clean image.
• Using the same framework, to train models on tasks of image denoising, super
resolution removing JPEG compression artifacts, non-blind image deblurring and
image inpainting.
Image Restoration Using Convolutional Auto-
encoders with Symmetric Skip Connections
The network contains layers of symmetric convolution (encoder) and deconvolution
(decoder). Skip shortcuts are connected every a few (for instance, two) layers from
convolutional feature maps to their mirrored deconvolutional feature maps. The
response from a convolutional layer is directly propagated to the corresponding
mirrored deconvolutional layer, both forwardly and backwardly.
FormResNet: Formatted Residual
Learning for Image Restoration
• A deep CNN to tackle the image restoration
problem by learning the structured residual.
• Image restoration by learning structured details
and recovering latent clean image, from the
shared info. btw corrupted and latent images.
• A residual formatting layer to format residual to
structured info., which allows to converge faster
and boosts the performance.
• A cross-level loss net to ensure both pixel-level
accuracy and semantic-level visual quality.
FormResNet: Formatted Residual
Learning for Image Restoration
(a) FormResNet: orange block represents the formatting layer; (b) cross-level loss net: incorporate pixel-wise
L2 norm, gradient consistency, and semantic high-level features, to better describe the similarity between
network inference and ground truth label; (c) RecursiveFormResNet: takes convol. layers as the formatting
layer in (a). It can be performed in a recursive fashion. ⊕ denotes pixel-wise subtraction/summation.
Deep Convolution Networks for
Compression Artifacts Reduction
• A compact and efficient network for seamless attenuation of
different compression artifacts.
• Accelerate the model by layer decomposition and joint use of large-
stride convolutional and deconvolutional layers.
• A more general CNN framework that has a close relationship with
the conventional Multi-Layer Perceptron (MLP).
• A deeper model can be effectively trained with features learned in a
shallow network.
• Transfer learning in low-level vision problems.
Deep Convolution Networks for
Compression Artifacts Reduction
There are two main modifications based on the AR-CNN. First, the layer
decomposition splits the original “feature enhancement” layer into a “shrinking”
layer and an “enhancement” layer. Then the large-stride convolutional and
deconvolutional layers significantly decrease the spatial size of the feature maps
of the middle layers. The overall shape of the framework is like an hourglass,
which is thick at the ends and thin in the middle.
Automatic Photo Adjustment Using Deep Learning
• Explore the use of deep learning in the context of photo editing;
• Introduce an image descriptor (pixel, context and global) that accounts for the local semantics
of an image.
Middle (from top to bottom):
input image, semantic label
map and the ground truth for
the Local Xpro effect;
Left and right: color mapping
scatter plots for four semantic
regions.
Automatic Photo Adjustment Using Deep Learning
The architecture of the DNN
Multi-scale spatial pooling schema
Pipeline for constructing the semantic label map
Automatic Photo Adjustment Using Deep Learning
Three Stylistic Local Effects:
1. Local Xpro,
2. Foreground Pop-Out,
3. Watercolor.
Deep Bilateral Learning for Real-Time
Image Enhancement
• Inspired by bilateral grid processing and local affine color transforms.
• Using pairs of input/output images, train a CNN to predict the coefficients
of a locally-affine model in bilateral space.
• Learn to make content-dependent decisions to approximate the desired
image transformation.
• The NN consumes a LR version of the input image, produces a set of affine
transformations in bilateral space, upsamples those transformations in an
edge-preserving fashion using a new slicing node, and then applies those
upsampled transformations to the full-resolution image.
Deep Bilateral Learning for Real-Time
Image Enhancement
Perform as much computation as possible at a low resolution, while still capturing high-frequency effects at full image resolution.
It consists of two distinct streams operating at different resolutions. The LR stream processes a downsampled version of the
input I through several conv. layers so as to estimate a bilateral grid of affine coefficients. This LR stream is further split in two
paths to learn both local and global features, which are fused before making the final prediction. The global and local paths share
a common set of low-level features. In turn, the HR stream performs a minimal yet critical amount of work: it learns a grayscale
guidance map used by our new slicing node to upsample the grid of affine coefficients back to full-resolution. These per-pixel
local affine transformations are then applied to the full-resolution input, which yields the final output.
Deep Edge aware filters
• To learn a big important family of edge-aware operators from
data.
• Based on a deep CNN with a gradient domain training
procedure, to approximate various filters without knowing the
original models.
• Enable fast approximation for complex edge-aware filters and
achieves up to 200x acceleration.
• Using spatially varying filter or filter combination.
FW(I) - a unified feed-forward process, I - input image,
F - network architecture, W - network parameters.
edge-aware filtering operators - L(I)
Deep Edge aware filters
A unified learning pipeline for various edge-aware filtering techniques.
Deep Edge aware filters
Deep Depth Super-Resolution : Learning
Depth Super-Resolution using Deep CNN
• Learn the mapping from a low resolution depth image to a high
resolution one in an end-to-end style.
• To better regularize the learned depth map, exploit the depth field
statistics and the local correlation btw depth image and color image.
• These priors are integrated in an energy minimization formulation,
where the deep NN learns the unary term, the depth field statistics
works as global model constraint and the color depth correlation is
utilized to enforce the local structure in depth images.
P extracts the gradients along X and Y directions.
The color modulated smoothness term
The total variation
Energy minimization
Deep Depth Super-Resolution : Learning
Depth Super-Resolution using Deep CNN
CNN gradually learns the high frequency components in depth images
Depth Map Super-Resolution
by Deep Multi-Scale Guidance
• Depth map super resolution in which a HR depth map is inferred from a
LR depth map and an additional HR intensity image of the same scene.
• Multi-Scale Guided convolutional network (MSG-Net) for depth map SR.
• MSG-Net complements LR depth features with HR intensity features
using a multi-scale fusion strategy.
• Such a multi-scale guidance allows the network to better adapt for
upsampling of both fine- and large-scale structures.
• Specifically, the rich hierarchical HR intensity features at different levels
progressively resolve ambiguity in depth map upsampling.
• A high-frequency domain training method to not only reduce training
time but also facilitate the fusion of depth and intensity features.
Depth Map Super-Resolution
by Deep Multi-Scale Guidance
Depth Map Super-Resolution
by Deep Multi-Scale Guidance
Depth Map Super-Resolution
by Deep Multi-Scale Guidance
The network architecture of MS-Net for single-image super resolution.
Depth Map Super-Resolution
by Deep Multi-Scale Guidance
(a) Color image. (b) Ground truth. (c) LR by 8. (d) SRCNN (e) MSG-Net
Accelerating the Super-Resolution
Convolutional Neural Network
• A compact hourglass-shape CNN structure for faster and better Super-
Resolution Convolutional Neural Network (SRCNN).
• Introduce a deconvolution layer, then the mapping is learned directly from
the original LR image (without interpolation) to the HR one.
• Reformulate the mapping layer by shrinking the input feature dimension
before mapping and expanding back afterwards.
• Adopt smaller filter sizes but more mapping layers.
Accelerating the Super-Resolution
Convolutional Neural Network
The FSRCNN consists of convolution layers and a deconvolution layer.
The convolution layers can be shared for different upscaling factors. A
specific deconvolution layer is trained for different upscaling factors.
Deep Laplacian Pyramid Networks for Fast
and Accurate Super-Resolution
• Laplacian Pyramid Super-Resolution Network (LapSRN) to
progressively reconstruct sub-band residuals of HR images.
• At each pyramid level, the model takes coarse-resolution
feature maps as input, predicts the HF residuals, and uses
transposed convolutions for upsampling to the finer level.
• Not require bicubic interpolation as the pre-processing step.
• Train the LapSRN with deep supervision using a Charbonnier
loss function and achieve high-quality reconstruction.
• The network generates multi-scale predictions in one feed-
forward pass through the progressive reconstruction,
thereby facilitates resource-aware applications.
Deep Laplacian Pyramid Networks for Fast
and Accurate Super-Resolution
Photo-Realistic Single Image Super-Resolution Using a
Generative Adversarial Network
• SRGAN, a generative adversarial network (GAN) for image superresolution (SR).
• Capable of inferring photo-realistic natural images for 4 upscaling factors.
• A perceptual loss function which consists of an adversarial loss and a content loss.
– The adversarial loss pushes the solution to the natural image manifold using a
discriminator network that is trained to differentiate between the super-resolved images
and original photo-realistic images.
– A content loss motivated by perceptual similarity instead of similarity in pixel space.
• The deep residual network is able to recover photo-realistic textures from heavily
downsampled images on public benchmarks.
Photo-Realistic Single Image Super-Resolution Using a
Generative Adversarial Network
Photo-Realistic Single Image Super-Resolution Using a
Generative Adversarial Network
Deep Joint Image Filtering
• Learning-based to construct a joint filter based on CNN.
• In contrast to considering only the guidance image, it selectively
transfers salient structures that are consistent in both guidance
and target images.
– The sub-networks CNNT and CNNG aim to extract informative feature
responses from the target and guidance images, respectively.
– These responses are concatenated together as input for network CNNF.
– Finally, model CNNF reconstructs the desired output by selectively
transferring main structures while suppressing inconsistent structures.
Deep Joint Image Filtering
The model consists of three major components
Given M training image samples
minimizing the summed squared loss
Deep Joint Image Filtering
Joint depth upsampling (8×) using different network architectures f1-f2-...
where fi is the filter size of the i-th layer. (a) GT depth map (inset: Guidance).
(b) Bicubic upsampling. (c)-(e) using CNNF. (f) using CNNT + CNNG + CNNF.
• Integration from multiple scales and semantic levels via multi-streams of
interlinked, layered, non-linear “deep” processing;
– Deep belief net with a variant of the mean-and-covariance RBM;
• Unsupervised feature learning;
– Supervised boundary prediction by feed forward NN.
Deep Neural Prediction Network for Visual Boundary
Deep Neural Prediction Network for Visual Boundary
DeepContour: A Deep Convolutional Feature Learned
by Positive-sharing Loss for Contour Detection
CNN structure: explicitly visualizing the dimensions of each network layers.
• Contour detection accuracy can be improved by instead making the use of the
deep features learned from CNNs.
• Customize the training strategy by partitioning contour (positive) data into
subclasses and fitting each subclass by different model parameters.
• A new loss function, named positive-sharing loss, in which each subclass shares
the loss for the whole positive class to learn the parameters
• It introduces an extra regularizer to emphasizes the losses for the positive and
negative classes, which facilitates to explore more discriminative features.
DeepEdge: A Multi-Scale Bifurcated Deep Network for
Top-Down Contour Detection
• Run the Canny edge detector to get candidate contour points.
• Around each candidate point, extract patches at four different scales and
simultaneously run them through the five convolutional layers of the KNet.
• Connect these convolutional layers to two separately-trained network branches.
• The first branch is trained for classification, the second is trained as a regressor.
• Outputs from these two sub-networks are averaged to produce the final score.
Holistically-Nested Edge Detection
• An edge detection algorithm that addresses two important issues:
(1) holistic image training and prediction; and (2) multi-scale and
multi-level feature learning.
• Holistically-nested edge detection (HED), performs image-to-image
prediction by means of a deep learning model that leverages fully
CNNs and deeply-supervised nets.
• HED automatically learns rich hierarchical representations to
resolve the challenging ambiguity in edge/boundary detection.
(a) multi-stream architecture; (b) skip-layer net architecture; (c) single model on multi-scale inputs; (d)
separate training of networks; (e) holistically-nested architectures.
Holistically-Nested Edge Detection
The receptive field and stride size in
VGGNet used in HED.
Deep supervision with side output layers to produce
multi-scale dense predictions. Left: the side outputs
become progressively coarser and more “global”, while
critical object boundaries are preserved. Right: the
predictions tends to lack any discernible order (e.g. in
layers 1 and 2), and many boundaries are lost in later
stages.
BOUNDARY DETECTION USING DEEP LEARNING
A image is processed at 3 different scales in order to obtain multi-scale information. The 3 scales are
fused and sent as input to the NCuts, that delivers eigenvectors and the resulting ‘Spectral
Boundaries’. The latter are fused with the original boundary map, non-maximum suppressed, and
optionally thresholded.
BOUNDARY DETECTION USING DEEP LEARNING
Network architecture for multi-resolution HED training: 3 differently scaled versions of the input image are
provided as inputs to 3 FCNN networks that share weights - their multi-resolution outputs are fused in a late
fusion stage, extending DSN to multi-resolution training.
• A deep learning algorithm for contour detection with a fully
convolutional encoder-decoder network;
• Different from previous low-level edge detection, focuses on detecting
higher-level object contours.
• Trained e2e with refined ground truth from inaccurate polygon
annotations, yielding much higher precision in object contour
detection;
• Learned model generalizes well to unseen object classes from the
same super-categories on MS COCO and can match state-of-the-art
edge detection on BSDS500 with fine-tuning.
• By combining with the multiscale combinatorial grouping algorithm,
generate high-quality segmented object proposals, which significantly
advance the state-of-the-art with a relatively small amount of
candidates.
Object Contour Detection with a Full Conv.
Encoder-Decoder Network
Object Contour Detection with a Full Conv.
Encoder-Decoder Network
Architecture of the fully convolutional encoder-decoder network
Context Encoders: Feature Learning by Inpainting
• Unsupervised feature learning driven by context-based pixel prediction.
• By analogy with auto-encoders, Context Encoders – a convolutional neural
network trained to generate the contents of an arbitrary image region
conditioned on its surroundings.
• Context encoders need to both understand the content of the entire
image, as well as produce a plausible hypothesis for the missing part(s).
• When training context encoders, both a standard pixel-wise
reconstruction loss, as well as a reconstruction plus an adversarial loss.
• A context encoder learns a representation that captures not just
appearance but also the semantics of visual structures.
Context Encoders: Feature Learning by Inpainting
(a) Context encoder trained with joint reconstruction and adversarial loss for semantic inpainting.
Context Encoders: Feature Learning by Inpainting
(b) Context encoder trained with reconstruction loss for feature
learning by filling in arbitrary region dropouts in the input.
High-Resolution Image Inpainting using
Multi-Scale Neural Patch Synthesis
• A multi-scale neural patch synthesis approach based on joint optimization
of image content and texture constraints, which not only preserves
contextual structures but also produces high-frequency details by
matching and adapting patches with the most similar mid-layer feature
correlations of a deep classification network.
Solve for an unknown image x using two loss functions, the holistic
content loss (Ec) and the local texture loss (Et).
High-Resolution Image Inpainting using
Multi-Scale Neural Patch Synthesis
The network architecture for structured content prediction. Unlike the L2 loss
architecture, replace all ReLU/ReLU leaky layers with the ELU layer and adopted
fully-connected layers instead of channel-wise fully-connected layers. The ELU unit
makes the regression network training more stable than the ReLU leaky layers as it
can handle large negative responses during the training process.
Semantic Image Inpainting with Deep
Generative Models
• It generates the missing content by conditioning on the
available data.
• Given a trained generative model, search for the closest
encoding of the corrupted image in the latent image
manifold using context and prior losses.
• This encoding is then passed through the generative model
to infer the missing content.
• Inference is possible irrespective of how the missing
content is structured, while the SoA learning based method
requires specific info. about the holes in the training phase.
• It successfully predicts info. in large missing regions and
achieves pixel-level photorealism.
Semantic Image Inpainting with Deep
Generative Models
Framework for inpainting. (a) Given a GAN model trained on real images, iteratively
update z to find the closest mapping on the latent image manifold, based on the
designed loss functions. (b) Manifold traversing when iteratively updating z using BP. z
(0) is random initialed; z (k) denotes the result in k-th iteration; and zˆ the final solution.
Semantic Image Inpainting with Deep
Generative Models
CE: Contextual Encoder, GAN: Generative Adversarial Network
Globally and Locally Consistent Image
Completion
• With a FCN, complete images of arbitrary resolutions by filling-
in missing regions of any shape.
• To train this image completion network to be consistent, use
global and local context discriminators that are trained to
distinguish real images from completed ones.
• The global discriminator looks at the entire image to assess if it
is coherent as a whole, while the local discriminator looks only
at a small area centered at the completed region to ensure the
local consistency of the generated patches.
• The network is trained to fool both context discriminator
networks, which requires it to generate images that are
indistinguishable from real ones with regard to overall
consistency as well as in details.
Globally and Locally Consistent Image
Completion
Overview of learning image completion. It consists of a completion network and two auxiliary
context discriminator networks that are used only for training the completion network. The global
discriminator network takes the entire image as input, while the local discriminator network takes
only a small region around the completed area as input. Both discriminator networks are trained to
determine if an image is real or completed by the completion network, while the completion
network is trained to fool both discriminator networks.
Globally and Locally Consistent Image
Completion
Architecture of the image
completion network.
Architectures of the discriminators used in the model.
Globally and Locally Consistent Image
Completion
Generative Face Completion
• Face completion using a deep generative model, a more challenging problem;
• To generate semantically new pixels for the missing key components (e.g., eyes and
mouths) that contain large appearance variations.
• Directly generates contents for missing regions based on a neural network.
• The model is trained with a combination of a reconstruction loss, two adversarial
losses and a semantic parsing loss, which ensures pixel faithfulness and local-global
contents consistency.
Generative Face Completion
Network architecture. It consists of one generator, two discriminators and a parsing
network. The generator takes the masked image as input and outputs the generated
image. Two discriminators are learned to distinguish the synthesize contents in the mask
and whole generated image as real and fake. The parsing network, which is a pretrained
model and remains fixed, is to further ensure the new generated contents more photo-
realistic and encourage consistency between new and old pixels. Note that only the
generator is needed during the testing.
Convolutional Neural Pyramid for
Image Processing
• A principled convolutional neural pyramid (CNP)
framework for general low-level vision and image
processing tasks.
• The pyramid structure can greatly enlarge the field
while not sacrificing computation efficiency.
• Adaptive network depth and progressive upsampling
for quasi-real-time testing on VGA-size input.
• A broad set of applications, i.e. depth/RGB image
restoration, completion, noise/artifact removal, edge
refinement, image filtering, enhancement and
colorization.
Convolutional Neural Pyramid for
Image Processing
Illustration of convol. neural pyramid. (a) shows the convol. pyramid structure. (b) and (c)
are the feature extraction and mapping components respectively. Conv(x, y) denotes the
convolution operation, where x is the kernel size and y is the number of output.
Convolutional Neural Pyramid for
Image Processing
Learning Recursive Filters for Low-Level
Vision via a Hybrid Neural Network
• Low-level vision problems (e.g., edge-preserving filtering and
denoising) as recursive image filtering via a hybrid neural network.
• The network contains several spatially variant RNNs as equivalents
of a group of distinct recursive filters for each pixel, and a deep CNN
that learns the weights of RNNs.
• The deep CNN can learn regulations of recurrent propagation for
various tasks and effectively guides recurrent propagation over an
entire image.
• The model does not need a large number of convolutional channels
nor big kernels to learn features for low-level vision filters.
• It is significantly smaller and faster in comparison with a deep CNN
based image filter.
Learning Recursive Filters for Low-Level
Vision via a Hybrid Neural Network
An illustrative example of the Hybrid NN for edge-preserving image smoothing with a single RNN.
Learning Recursive Filters for Low-Level
Vision via a Hybrid Neural Network
The hybrid network that contains a group of RNNs to filter/restore an image and a
deep CNN to learn to propagate the RNNs. The process of filtering/restoration is
carried out via RNNs with two inputs and one output result. Both parts are trained
jointly in an end-to-end fashion.
Colorful Image Colorization
• A fully automatic approach that produces vibrant and realistic colorizations.
• A classification task and use class-rebalancing at training time to increase
the diversity of colors in the result.
• Colorization can be a powerful pretext task for self-supervised feature
learning, acting as a cross-channel encoder.
The network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and
ReLU layers, followed by a BatchNorm layer. The net has no pool layers. All changes in
resolution are achieved through spatial downsampling or upsampling btw conv blocks.
Colorful Image Colorization
Classification loss with rebalancing produces more accurate and vibrant results
than a regression loss or a classification loss without rebalancing.
Appendix
Deep Learning
Deep Learning
• Representation learning attempts to automatically learn good features or
representations;
• Deep learning algorithms attempt to learn multiple levels of representation
of increasing complexity/abstraction (intermediate and high level features);
• Become effective via unsupervised pre-training + supervised fine tuning;
– Deep networks trained with back propagation (without unsupervised pre-
training) perform worse than shallow networks.
• Deal with the curse of dimensionality (smoothing & sparsity) and over-
fitting (unsupervised, regularizer);
• Semi-supervised: structure of manifold assumption;
– labeled data is scarce and unlabeled data is abundant.
Why Deep Learning?
• Supervised training of deep models (e.g. many-layered Nets) is too hard
(optimization problem);
• Learn prior from unlabeled data;
• Shallow models are not for learning high-level abstractions;
• Ensembles or forests do not learn features first;
• Graphical models could be deep net, but mostly not.
• Unsupervised learning could be “local-learning”;
• Resemble boosting with each layer being like a weak learner
• Learning is weak in directed graphical models with many hidden variables;
• Sparsity and regularizer.
• Traditional unsupervised learning methods aren’t easy to learn multiple
levels of representation.
• Layer-wised unsupervised learning is the solution.
• Multi-task learning (transfer learning and self taught learning);
• Other issues: scalability & parallelism with the burden from big data.
Multi Layer Neural Network
• A neural network = running several logistic regressions at the same time;
– Neuron=logistic regression or…
• Calculate error derivatives (gradients) to refine: back propagate the error
derivative through model (the chain rule)
– Online learning: stochastic/incremental gradient descent;
– Batch learning: conjugate gradient descent.
Convolutional Neural Networks (CNN)
• CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually
images), based on spatially localized neural input;
– local receptive fields(shifted window), shared weights (weight averaging) across
the hidden units, and often, spatial or temporal sub-sampling;
– Related to generative MRF/discriminative CRF:
• CNN=Field of Experts MRF=ML inference in CRF;
– Generate ‘patterns of patterns’ for pattern recognition.
• Each layer combines (merge, smooth) patches from previous layers
– Pooling /Sampling (e.g., max or average) filter: compress and smooth the data.
– Convolution filters: (translation invariance) unsupervised;
– Local contrast normalization: increase sparsity, improve optimization/invariance.
C layers
convolutions,
S layers pool/sample
Convolutional Neural Networks (CNN)
• Convolutional Networks are trainable multistage architectures composed of multiple stages;
• Input and output of each stage are sets of arrays called feature maps;
• At output, each feature map represents a particular feature extracted at all locations on input;
• Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer;
• A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module;
– A fully connected layer: softmax transfer function for posterior distribution.
• Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature
map;
• Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function;
– In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N;
• Feature pooling: treats each feature map separately -> a reduced-resolution output feature
map;
• Supervised training is performed using a form of SGD to minimize the prediction error;
– Gradients are computed with the back-propagation method.
• Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine-
tuning.
* is discrete convolution operator
LeNet (LeNet-5)
• A layered model composed of convolution and subsampling operations
followed by a holistic representation and ultimately a classifier for
handwritten digits;
• Local receptive fields (5x5) with local connections;
• Output via a RBF function, one for each class, with 84 inputs each;
• Learning by Graph Transformer Networks (GTN);
AlexNet
• A layered model composed of convol., subsample., followed
by a holistic representation and all-in-all a landmark classifier;
• Consists of 5 convolutional layers, some of which followed by
max-pooling layers, 3 fully-connected layers with a final 1000-
way softmax;
• Fully-connected “FULL” layers: linear classifiers/matrix
multiplications;
• ReLU are rectified-linear nonlinearities on layer output, can be
trained several times faster;
• Local normalization scheme aids generalization;
• Overlapping pooling slightly less prone to overfitting;
• Data augmentation: artificially enlarge the dataset using label-
preserving transformations;
• Dropout: setting to zero output of each hidden neuron with
prob. 0.5;
• Trained by SGD with batch # 128, momentum 0.9, weight
decay 0.0005.
The network’s input is 150,528-dimensional, and the number of neurons in the network’s
remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000.
AlexNet
MattNet
• Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013;
• Preprocessing: subtracting a per-pixel mean;
• Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of
the image and randomly flipped horizontally to provide more views of each example;
• SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent
overfitting;
• 65M parameters trained for 12 days on a single Nvidia GPU;
• Visualization by layered DeconvNets: project the feature activations back to the input pixel
space;
– Reveal input stimuli exciting individual feature maps at any layer;
– Observe evolution of features during training;
– Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are
important;
• DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve
structure;
• Multiple such models were averaged together to further boost performance;
• Supervised pre-training with AlexNet, then modify it to get better performance (error rate
14.8%).
Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3
color planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature
maps: (i) via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized
55x55 feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216
dimensions). The final layer: a C-way softmax function, C - number of classes.
MattNet
Top: A deconvnet layer (left) attached to
a convnet layer (right). The deconvnet
will reconstruct approximate version of
convnet features from the layer
beneath.
Bottom: Unpooling operation in the
deconvnet, using switches which record
the location of the local max in each
pooling region (colored zones) during
pooling in the convnet.
MattNet
Deep Belief Networks
• A hybrid model: can be trained as generative or
discriminative model;
• Deep architecture: multiple layers (learn features
layer by layer);
• Multi layer learning is difficult in sigmoid belief
networks.
• Top two layers are undirected connections,
Restricted Boltzmann Machine (RBM);
• Lower layers get top down directed
connections from layers above;
• Unsupervised or self-taught pre-learning provides
a good initialization;
• Greedy layer-wise unsupervised training for
RBM;
• Supervised fine-tuning
• Generative: wake-sleep algorithm (Up-down);
• Discriminative: back propagation (bottom-up);
Belief net is directed acyclic graph
composed of stochastic variables.
Deep Boltzmann Machine
• Boltzmann machine is a stochastic recurrent model, and RBM is its special case (one
hidden layer);
• Learning internal representations that become increasingly complex;
• High-level representations built from a large supply of unlabeled inputs;
• Pre-training: learning a stack of modified RBMs, which are composed to create a deep
Boltzmann machine (undirected graph);
• Generative fine-tuning: different from DBN
• Positive and negative phase
• Discriminative fine-tuning: the same to DBN
• Back propagation.
Stacked Denoising Auto-Encoder
• Denoising Auto-Encoder: Multilayer NNs with target output=input;
• Auto-encoder learns the salient variation like a nonlinear PCA;
• Stack many (may be sparse) auto-encoders in succession and train them using greedy
layer-wise unsupervised learning
• Drop the decode layer each time
• Performs better than stacking RBMs;
• Supervised training on the last layer using final features;
• (option) Supervised training on the entire network to fine- tune all weights of the
neural net;
• Empirically not quite as accurate as DBNs.
Stochastic Gradient Descent (SGD)
• The general class of estimators that arise as minimizers of sums are
called M-estimators;
• Where are stationary points of the likelihood function (or zeroes of its
derivative, the score function)?
• Online gradient descent samples a subset of summand functions at every
step;
• The true gradient of is approximated by a gradient at a single example;
• Shuffling of training set at each pass.
• There is a compromise between two forms, often called "mini-batches",
where the true gradient is approximated by a sum over a small number of
training examples.
• STD converges almost surely to a global minimum when the objective
function is convex or pseudo-convex, and otherwise converges almost
surely to a local minimum.
Back Propagation
E (f(x0,w),y0) = -log (f(x0,w)- y0).
Loss function
• Euclidean loss is used for regressing to real-valued lables [-inf,inf];
• Sigmoid cross-entropy loss is used for predicting K independent probability
values in [0,1];
• Softmax (normalized exponential) loss is predicting a single class of K mutually
exclusive classes;
– Generalization of the logistic function that "squashes" a K-dimensional vector
of arbitrary real values z to a K-dimensional vector of real values σ(z) in the
range (0, 1).
– The predicted probability for the j'th class given a sample vector x is
• Sigmoidal or Softmax normalization is a way of reducing the influence of
extreme values or outliers in the data without removing them from the
dataset.
Variable Learning Rate
• Too large learning rate
– cause oscillation in searching for the minimal point
• Too slow learning rate
– too slow convergence to the minimal point
• Adaptive learning rate
– At the beginning, the learning rate can be large when the current point is
far from the optimal point;
– Gradually, the learning rate will decay as time goes by.
• Should not be too large or too small:
– annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇)
– 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a
constant.
Variable Momentum
AdaGrad/AdaDelta
Data Augmentation for Overfitting
• The easiest and most common method to reduce
overfitting on image data is to artificially enlarge the
dataset using label-preserving transformations;
• Perturbing an image I by transformations that leave
the underlying class unchanged (e.g. cropping and
flipping) in order to generate additional examples of
the class;
• Two distinct forms of data augmentation:
– image translation
– horizontal reflections
– changing RGB intensities
Weight Decay for Overfitting
• Weight decay or L2 regularization adds a penalty term to the error function, a term
called the regularization term: the negative log prior in Bayesian justification,
– Weight decay works as rescaling weights in the learning rule, but bias learning still the
same;
– Prefer to learn small weights, and large weights allowed if improving the original cost
function;
– A way of compromising btw finding small weights and minimizing the original cost
function;
• In a linear model, weight decay is equivalent to ridge (Tikhonov) regression;
• L1 regularization: the weights not really useful shrink by a constant amount
toward zero;
– Act like a form of feature selection;
– Make the input filters cleaner and easier to interpret;
• L2 regularization penalizes large values strongly while L1 regularization ;
• Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium
distr. is the posterior distribution for weights & hyper-parameters;
• Hybrid Monte Carlo: gradient and sampling.
Early Stopping for Overfitting
• Steps in early stopping:
– Divide the available data into training and validation sets.
– Use a large number of hidden units.
– Use very small random initial values.
– Use a slow learning rate.
– Compute the validation error rate periodically during training.
– Stop training when the validation error rate "starts to go up".
• Early stopping has several advantages:
– It is fast.
– It can be applied successfully to networks in which the number of weights far exceeds
the sample size.
– It requires only one major decision by the user: what proportion of validation cases to
use.
• Practical issues in early stopping:
– How many cases do you assign to the training and validation sets?
– Do you split the data into training and validation sets randomly or by some systematic
algorithm?
– How do you tell when the validation error rate "starts to go up"?
Dropout and Maxout for Overfitting
• Dropout: set the output of each hidden neuron to zero w.p. 0.5.
– Motivation: Combining many different models that share parameters
succeeds in reducing test errors by approximately averaging together the
predictions, which resembles the bagging.
– The units which are “dropped out” in this way do not contribute to the
forward pass and do not participate in back propagation.
– So every time an input is presented, the NN samples a different architecture,
but all these architectures share weights.
– This technique reduces complex co-adaptations of units, since a neuron
cannot rely on the presence of particular other units.
– It is, therefore, forced to learn more robust features that are useful in
conjunction with many different random subsets of the other units.
– Without dropout, the network exhibits substantial overfitting.
– Dropout roughly doubles the number of iterations required to converge.
• Maxout takes the maximum across multiple feature maps;
MCMC Sampling for Optimization
• Markov Chain: a stochastic process in which future states are independent of
past states but the present state.
– Markov chain will typically converge to a stable distribution.
• Monte Carlo Markov Chain: sampling using ‘local’ information
– Devise a Markov chain whose stationary distribution is the target.
• Ergodic MC must be aperiodic, irreducible, and positive recurrent.
– Monte Carlo Integration to get quantities of interest.
• Metropolis-Hastings method: sampling from a target distribution
– Create a Markov chain whose transition matrix does not depend on the normalization term.
– Make sure the chain has a stationary distr. and it is equal to the target distr. (accept ratio).
– After sufficient number of iterations, the chain will converge the stationary distribution.
• Gibbs sampling is a special case of M-H Sampling.
– The Hammersley-Clifford Theorem: get the joint distr. from the complete conditional distr.
• Hybrid Monte Carlo: gradient sub step for each Markov chain.
Mean Field for Optimization
• Variational approximation modifies the optimization problem to
be tractable, at the price of approximate solution;
• Mean Field replaces M with a (simple) subset M(F), on which A*
(μ) is a closed form (Note: F is disconnected graph);
– Density becomes factorized product distribution in this sub-family.
– Objective: K-L divergence.
• Mean field is a structured variation approximation approach:
– Coordinate ascent (deterministic);
• Compared with stochastic approximation (sampling):
– Faster, but maybe not exact.
Contrastive Divergence for RBMs
• Contrastive divergence (CD) is proposed for training PoE first, also being
a quicker way to learn RBMs;
– Contrastive divergence as the new objective;
– Taking gradients and ignoring a term which is usually very small.
• Steps:
– Start with a training vector on the visible units.
– Then alternate between updating all the hidden units in parallel and
updating all the visible units in parallel.
• Can be applied using any MCMC algorithm to simulate the model (not
limited to just Gibbs sampling);
• CD learning is biased: not work as gradient descent
• Improved: Persistent CD explores more modes in the distribution
– Rather than from data samples, begin sampling from the mode samples,
obtained from the last gradient update.
– Still suffer from divergence of likelihood due to missing the modes.
• Score matching: the score function does not depend on its normal.
factor. So, match it b.t.w. the model with the empirical density.
“Wake-Sleep” Algorithm for DBN
• Pre-trained DBN is a generative model;
• Do a stochastic bottom-up pass (wake phase)
– Get samples from factorial distribution (visible first, then generate hidden);
– Adjust the top-down weights to be good at reconstructing the feature activities
in the layer below.
• Do a few iterations of sampling in the top level RBM
– Adjust the weights in the top-level RBM.
• Do a stochastic top-down pass (sleep phase)
– Get visible and hidden samples generated by generative model using data
coming from nowhere!
– Adjust the bottom-up weights to be good at reconstructing the feature
activities in the layer above.
– Any guarantee for improvement? No!
• The “Wake-Sleep” algorithm is trying to describe the
representation economical (Shannon’s coding theory).
Greedy Layer-Wise Training
• Deep networks tend to have more local minima problems than shallow
networks during supervised training
• Train first layer using unlabeled data
– Supervised or semi-supervised: use more unlabeled data.
• Freeze the first layer parameters and train the second layer
• Repeat this for as many layers as desire
– Build more robust features
• Use the outputs of the final layer to train the last supervised layer (leave
early weights frozen)
• Fine tune the full network with a supervised approach;
• Avoid problems to train a deep net in a supervised fashion.
– Each layer gets full learning
– Help with ineffective early layer learning
– Help with deep network local minima
Why Greedy Layer-Wise Training Works?
• Take advantage of the unlabeled data;
• Regularization Hypothesis
– Pre-training is “constraining” parameters in a region
relevant to unsupervised dataset;
– Better generalization (representations that better
describe unlabeled data are more discriminative for
labeled data) ;
• Optimization Hypothesis
– Unsupervised training initializes lower level parameters
near localities of better minima than random
initialization can.
• Only need fine tuning in the supervised learning stage.
Generative Modeling
• Have training examples x ~ pdata(x )
• Want a model that draw samples: x ~ pmodel(x )
• Where pmodel ≈ pdata
• Conditional generative models
– Speech synthesis: Text ⇒ Speech
– Machine Translation: French ⇒ English
• French: Si mon tonton tond ton tonton, ton tonton sera tondu.
• English: If my uncle shaves your uncle, your uncle will be shaved
– Image ⇒ Image segmentation
• Environment simulator
– Reinforcement learning
– Planning
• Leverage unlabeled data
x ~ pdata(x )
x ~ pmodel(x )
Adversarial Nets Framework
• A game between two
players:
– 1. Discriminator D
– 2. Generator G
• D tries to discriminate
between:
– A sample from the data
distribution.
– And a sample from the
generator G.
• G tries to “trick” D by
generating samples that
are hard for D to
distinguish from data.
GANs
• A framework for estimating generative
models via an adversarial process, to train 2
models: a generative model G that captures
the data distribution, and a discriminative
model D that estimates the probability that a
sample came from the training data rather
than G.
• The training procedure for G is to maximize
the probability of D making a mistake.
• This framework corresponds to a minimax
two-player game:
– In the space of arbitrary functions G and D, a
unique solution exists, with G recovering training
data distribution and D equal to 1/2 everywhere;
– In the case where G and D are defined by
multilayer perceptrons, the entire system can be
trained with BP.
– There is no need for any Markov chains or
unrolled approximate inference networks during
either training or generation of samples.
GANs
GANs
GANs
GANs
Rightmost column shows the nearest training example of the neighboring sample, in order to
demonstrate that the model has not memorized the training set. Samples are fair random
draws, not cherry-picked. Unlike most other visualizations of deep generative models, these
images show actual samples from the model distributions, not conditional means given
samples of hidden units. Moreover, these samples are uncorrelated because the sampling
process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully
connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator).
How to Train a GAN? Tips and Tricks
• 1. Normalize the inputs
• 2: A modified loss function
• 3: Use a spherical Z (not uniform, but
Gaussian distribution)
• 4: Batch Norm
• 5: Avoid Sparse Gradients:
– ReLU, MaxPool
• 6: Use Soft and Noisy Labels
• 7: DCGAN / Hybrid Models
– KL + GAN or VAE + GAN
• 8: Use stability tricks from RL
• 9: Use the ADAM Optimizer for
generator (SGD for discriminator)
• 10: Track failures early
– check norms of gradients
• 11: Dont balance loss via statistics
(unless you have a good reason to)
• 12: If you have labels, use them
– Auxillary GANs
• 13: Add noise to inputs, decay over
time
• 14: [not sure] Train discriminator
more (sometimes) especially have
noise
• 15: [not sure] Batch Discrimination
• 16: Discrete variables in C-GANs
• 17: Dropouts in G in both train/test
stage
Improved Techniques for Training GANs
• For semi-supervised learning in generation of images that humans find visually
realistic;
• Techniques that are heuristically motivated to encourage convergence:
– Feature matching addresses the instability of GANs by specifying a new objective for
the generator that prevents it from overtraining on the current discriminator;
– Allow the discriminator to look at multiple data examples in combination, and
perform what is called “Min-batch discrimination”: any discriminator model that
looks at multiple examples in combination, rather than in isolation, could potentially
help avoid collapse of the generator;
– Historical averaging: the historical average of the parameters can be updated in an
online fashion so this learning rule scales well to long time series;
– One sided label smoothing: reduce the vulnerability of NNs to adversarial examples;
– Virtual batch normalization: each example x is normalized based on the statistics
collected on a reference batch of examples that are chosen once and fixed at the
start of training, and on x itself (only in the generator network, cause too expensive
computationally).
Thanks!

More Related Content

What's hot

What's hot (20)

Artificial Intelligence: Artificial Neural Networks
Artificial Intelligence: Artificial Neural NetworksArtificial Intelligence: Artificial Neural Networks
Artificial Intelligence: Artificial Neural Networks
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Cnn
CnnCnn
Cnn
 
HML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep LearningHML: Historical View and Trends of Deep Learning
HML: Historical View and Trends of Deep Learning
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Introduction to Image Compression
Introduction to Image CompressionIntroduction to Image Compression
Introduction to Image Compression
 
Deep learning
Deep learningDeep learning
Deep learning
 
Noise Models
Noise ModelsNoise Models
Noise Models
 
Introduction to Deep Learning
Introduction to Deep LearningIntroduction to Deep Learning
Introduction to Deep Learning
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
Adaline madaline
Adaline madalineAdaline madaline
Adaline madaline
 
digital image processing
digital image processingdigital image processing
digital image processing
 
Arithmetic coding
Arithmetic codingArithmetic coding
Arithmetic coding
 
Feedforward neural network
Feedforward neural networkFeedforward neural network
Feedforward neural network
 
Convolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep LearningConvolutional Neural Network Models - Deep Learning
Convolutional Neural Network Models - Deep Learning
 
Regularization in deep learning
Regularization in deep learningRegularization in deep learning
Regularization in deep learning
 
Image analysis using python
Image analysis using pythonImage analysis using python
Image analysis using python
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
AGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptxAGE AND GENDER DETECTION.pptx
AGE AND GENDER DETECTION.pptx
 

Similar to Deep learning for image video processing

Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 
intro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptxintro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptx
ssuser3aa461
 

Similar to Deep learning for image video processing (20)

Image Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A surveyImage Segmentation Using Deep Learning : A survey
Image Segmentation Using Deep Learning : A survey
 
Knowledge Graphs - Sravan.pptx
Knowledge Graphs - Sravan.pptxKnowledge Graphs - Sravan.pptx
Knowledge Graphs - Sravan.pptx
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
DL.pdf
DL.pdfDL.pdf
DL.pdf
 
Mnist report
Mnist reportMnist report
Mnist report
 
Deep learning for image super resolution
Deep learning for image super resolutionDeep learning for image super resolution
Deep learning for image super resolution
 
Deep learning for image super resolution
Deep learning for image super resolutionDeep learning for image super resolution
Deep learning for image super resolution
 
Mnist report ppt
Mnist report pptMnist report ppt
Mnist report ppt
 
CNN.pptx
CNN.pptxCNN.pptx
CNN.pptx
 
A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution A Fully Progressive approach to Single image super-resolution
A Fully Progressive approach to Single image super-resolution
 
Review-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learningReview-image-segmentation-by-deep-learning
Review-image-segmentation-by-deep-learning
 
JPM1414 Progressive Image Denoising Through Hybrid Graph Laplacian Regulariz...
JPM1414  Progressive Image Denoising Through Hybrid Graph Laplacian Regulariz...JPM1414  Progressive Image Denoising Through Hybrid Graph Laplacian Regulariz...
JPM1414 Progressive Image Denoising Through Hybrid Graph Laplacian Regulariz...
 
cnn.pdf
cnn.pdfcnn.pdf
cnn.pdf
 
intro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptxintro-to-cnn-April_2020.pptx
intro-to-cnn-April_2020.pptx
 
Overview of Convolutional Neural Networks
Overview of Convolutional Neural NetworksOverview of Convolutional Neural Networks
Overview of Convolutional Neural Networks
 
Introduction to computer vision
Introduction to computer visionIntroduction to computer vision
Introduction to computer vision
 
Depth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep LearningDepth Fusion from RGB and Depth Sensors by Deep Learning
Depth Fusion from RGB and Depth Sensors by Deep Learning
 
Introduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural NetworksIntroduction to computer vision with Convoluted Neural Networks
Introduction to computer vision with Convoluted Neural Networks
 
A brief introduction to recent segmentation methods
A brief introduction to recent segmentation methodsA brief introduction to recent segmentation methods
A brief introduction to recent segmentation methods
 
build a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in Pythonbuild a Convolutional Neural Network (CNN) using TensorFlow in Python
build a Convolutional Neural Network (CNN) using TensorFlow in Python
 

More from Yu Huang

More from Yu Huang (20)

Application of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous DrivingApplication of Foundation Model for Autonomous Driving
Application of Foundation Model for Autonomous Driving
 
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...The New Perception Framework  in Autonomous Driving: An Introduction of BEV N...
The New Perception Framework in Autonomous Driving: An Introduction of BEV N...
 
Data Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous DrivingData Closed Loop in Simulation Test of Autonomous Driving
Data Closed Loop in Simulation Test of Autonomous Driving
 
Techniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous DrivingTechniques and Challenges in Autonomous Driving
Techniques and Challenges in Autonomous Driving
 
BEV Joint Detection and Segmentation
BEV Joint Detection and SegmentationBEV Joint Detection and Segmentation
BEV Joint Detection and Segmentation
 
BEV Object Detection and Prediction
BEV Object Detection and PredictionBEV Object Detection and Prediction
BEV Object Detection and Prediction
 
Fisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VIFisheye based Perception for Autonomous Driving VI
Fisheye based Perception for Autonomous Driving VI
 
Fisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving VFisheye/Omnidirectional View in Autonomous Driving V
Fisheye/Omnidirectional View in Autonomous Driving V
 
Fisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IVFisheye/Omnidirectional View in Autonomous Driving IV
Fisheye/Omnidirectional View in Autonomous Driving IV
 
Prediction,Planninng & Control at Baidu
Prediction,Planninng & Control at BaiduPrediction,Planninng & Control at Baidu
Prediction,Planninng & Control at Baidu
 
Cruise AI under the Hood
Cruise AI under the HoodCruise AI under the Hood
Cruise AI under the Hood
 
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
LiDAR in the Adverse Weather: Dust, Snow, Rain and Fog (2)
 
Scenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous DrivingScenario-Based Development & Testing for Autonomous Driving
Scenario-Based Development & Testing for Autonomous Driving
 
How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?How to Build a Data Closed-loop Platform for Autonomous Driving?
How to Build a Data Closed-loop Platform for Autonomous Driving?
 
Annotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous DrivingAnnotation tools for ADAS & Autonomous Driving
Annotation tools for ADAS & Autonomous Driving
 
Simulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atgSimulation for autonomous driving at uber atg
Simulation for autonomous driving at uber atg
 
Multi sensor calibration by deep learning
Multi sensor calibration by deep learningMulti sensor calibration by deep learning
Multi sensor calibration by deep learning
 
Prediction and planning for self driving at waymo
Prediction and planning for self driving at waymoPrediction and planning for self driving at waymo
Prediction and planning for self driving at waymo
 
Jointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planningJointly mapping, localization, perception, prediction and planning
Jointly mapping, localization, perception, prediction and planning
 
Data pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous drivingData pipeline and data lake for autonomous driving
Data pipeline and data lake for autonomous driving
 

Recently uploaded

VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 

Recently uploaded (20)

Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 

Deep learning for image video processing

  • 1. Deep Learning for Image/Video Processing Yu Huang Sunnyvale, California yu.huang07@gmail.com
  • 2. Outline • Image denoising • Denoiser prior • Image deconvolution • Image/depth superesolution • Image restoration • DehazeNet • Artifact reduction • Image enhancement • Edge aware filters • Joint image processing • DeepContour • DeepEdge • Holistically-nested edge detection • Boundary detection • Inpainting • Colorization • Appendix: deep learning
  • 3. Image Denoising by Conv. Nets • Image denoising is a learning problem to training Conv. Net; – Parameter estimation to minimize the reconstruction error. • Online learning (rather than batch learning): stochastic gradient – Gradient update from 6x6 patches sampled from 6 different training images • Run like greedy layer-wise training for each layer.
  • 4. Image Denoising by MLP • Denoising as learning: map noisy patches to noise-free ones; – Patch size 17x17; • Training with different noise types and levels: – Sigma=25; noise as Gaussian, stripe, salt-and-pepper, coding artifact; • Feed-forward NN: MLP; – input layer 289-d, four hidden layers (2047-d), output layer 289-d. – input layer 169-d, four hidden layers (511-d), output layer 169-d. • 40 million training images from LabelMe and Berkeley segmentation! • 1000 testing images: Mcgill, Pascal VOC 2007; • GPU: slower than BM3D, much faster than KSVD. • Deep learning can help: unsupervised learning from unlabelled data.
  • 5. Image Restoration by CNN • Collect a dataset of clean/corrupted image pairs which are then used to train a specialized form of convolutional neural network. • Given a noisy image x, predict a clean image y close to the clean image y* – the input kernels p1 = 16, the output kernel pL = 8. – 2 hidden layers (i.e. L = 3), each with 512 units, the middle layer kernel p2 = 1. – W1 512 kernels of size 16x16x3, W2 512 kernels of size 1x1x512, and W3 size 8x8x512. • This learns how to map corrupted image patches to clean ones, implicitly capturing the characteristic appearance of noise in natural images. – Train the weights Wl and biases bl by minimizing the mean squared error – Minimize with SGD • Regarded as: first patchifying the input, applying a fully-connected neural network to each patch, and averaging the resulting output patches.
  • 6. Image Restoration by CNN • Comparison.
  • 7. Image Deconvolution with Deep CNN – Establish the connection between traditional optimization-based schemes and a CNN architecture; – A separable structure is used as a reliable support for robust deconvolution against artifacts; – The deconvolution task can be approximated by a convolutional network by nature, based on the kernel separability theorem; – Kernel separability is achieved via SVD; • An inverse kernel with length 100 is enough for plausible deconv. results; – Image deconvolution convolutional neural network (DCNN); • Two hidden layers: h1 is 38 large-scale 1-d kernels of size 121×1, and h2 is 381x121 convolution kernels to each in h1, output is 1×1×38 kernel; • Random-weight initialization or from the separable kernel inversion; – Concatenation of deconvolution CNN module with denoising CNN; • called “Outlier-rejection Deconvolution CNN (ODCNN)”; – 2 million sharp patches together with their blurred versions in training.
  • 9. Learning Deep CNN Denoiser Prior for Image Restoration • With the aid of variable splitting techniques, denoiser prior can be plugged in as a modular part of model-based optimization methods to solve other inverse problems (e.g., deblurring). • Such an integration induces considerable advantage when the denoiser is obtained via discriminative learning. • Train a set of fast and effective CNN denoisers and integrate them into model-based optimization method to solve other inverse problems. • Use Dilated Filter to enlarge Receptive Field. • Use Batch Normaliz. and Residual Learning to accelerate training. • Use training samples with small size to Help avoid boundary Artifacts. • Learning specific denoiser model with small interval noise levels.
  • 10. Learning Deep CNN Denoiser Prior for Image Restoration • It consists of 7 layers with 3 blocks, i.e., “Dilated Convolution + ReLU” block in the 1st layer, 5 “Dilated Convolution + Batch Normalization + ReLU” blocks in the middle layers, and “Dilated Convolution” block in the last layer. • The dilation factors of (3×3) dilated convolutions from 1st layer to the last layer are set to 1, 2, 3, 4, 3, 2 and 1, respectively. The architecture of the CNN denoiser network
  • 11. Pixel Recurrent Neural Networks • A deep neural network that sequentially predicts the pixels in an image along the two spatial dimensions. • It models the discrete probability of the raw pixel values and encodes the complete set of dependencies in the image. • Fast 2-d recurrent layers and an effective use of residual connections in deep recurrent networks.
  • 12. Pixel Recurrent Neural Networks input-to-state and state-to-state mappings  Row LSTM is a unidirectional layer that processes the image row by row from top to bottom computing features for a whole row at once; the computation is performed with a one-dimensional convolution.  Diagonal BiLSTM is designed to both parallelize the computation and to capture the entire available context for any image size. Each of the two directions of the layer scans the image in a diagonal fashion starting from a corner at the top and reaching the opposite corner at the bottom. Each step in the computation computes at once the LSTM state along a diagonal in the image.  PixelCNN uses multiple convolutional layers that preserve the spatial resolution; pooling layers are not used.  Multi-Scale PixelRNN is composed of an unconditional PixelRNN and one or more conditional PixelRNNs.
  • 13. DehazeNet by CNN for Dehaze DehazeNet conceptually consists of four sequential operations (feature extraction, multi-scale mapping, local extremum and non-linear regression), which is constructed by 3 convolution layers, a max-pooling, a Maxout unit and a BReLU activation function.
  • 14. Removing rain from single images via a deep detail network • Removing rain streaks from individual images based on deep CNN. • ResNet simplifies the learning process by changing the mapping form, so directly reduce the mapping range from input to output, which makes the learning process easier. • A priori image domain knowledge by focusing on high frequency detail during training, which removes BG interference and focuses the model on the structure of rain in images. • It not only has benefits for high-level vision tasks but also can be used to solve low level imaging problems.
  • 15. Removing rain from single images via a deep detail network
  • 16. Removing rain from single images via a deep detail network The five network architectures for the rain removal problem: Direct network, Neg- mapping, ResNet, ResNet+Neg-mapping and the deep detail network (from left to right). Note: SSIM of (b)–(g) are 0.774, 0.490, 0.926, 0.936, 0.938 and 0.940, respectively.
  • 17. Image Restoration Using Convolutional Auto- encoders with Symmetric Skip Connections • Image restoration, including image denoising, super resolution, inpainting, and so on; • A deep fully convolutional auto-encoder network for image restoration, which is a encoding-decoding framework with symmetric convolutional-deconvolutional layers. • The convol. layers capture abstraction of images while eliminating corruptions. • Deconvol. layers have the ability to upsample feature maps, recover image details. • Symmetrically link convolutional and deconvolutional layers with skip-layer connections, with which the training converges faster and attains better results. • These skip connections allow the signal to be back-propagated to bottom layers directly, and thus tackles the problem of gradient vanishing, making training deep networks easier and achieving restoration performance gains consequently. • They pass image details from convolutional layers to deconvolutional layers, which is beneficial in recovering the clean image. • Using the same framework, to train models on tasks of image denoising, super resolution removing JPEG compression artifacts, non-blind image deblurring and image inpainting.
  • 18. Image Restoration Using Convolutional Auto- encoders with Symmetric Skip Connections The network contains layers of symmetric convolution (encoder) and deconvolution (decoder). Skip shortcuts are connected every a few (for instance, two) layers from convolutional feature maps to their mirrored deconvolutional feature maps. The response from a convolutional layer is directly propagated to the corresponding mirrored deconvolutional layer, both forwardly and backwardly.
  • 19. FormResNet: Formatted Residual Learning for Image Restoration • A deep CNN to tackle the image restoration problem by learning the structured residual. • Image restoration by learning structured details and recovering latent clean image, from the shared info. btw corrupted and latent images. • A residual formatting layer to format residual to structured info., which allows to converge faster and boosts the performance. • A cross-level loss net to ensure both pixel-level accuracy and semantic-level visual quality.
  • 20. FormResNet: Formatted Residual Learning for Image Restoration (a) FormResNet: orange block represents the formatting layer; (b) cross-level loss net: incorporate pixel-wise L2 norm, gradient consistency, and semantic high-level features, to better describe the similarity between network inference and ground truth label; (c) RecursiveFormResNet: takes convol. layers as the formatting layer in (a). It can be performed in a recursive fashion. ⊕ denotes pixel-wise subtraction/summation.
  • 21. Deep Convolution Networks for Compression Artifacts Reduction • A compact and efficient network for seamless attenuation of different compression artifacts. • Accelerate the model by layer decomposition and joint use of large- stride convolutional and deconvolutional layers. • A more general CNN framework that has a close relationship with the conventional Multi-Layer Perceptron (MLP). • A deeper model can be effectively trained with features learned in a shallow network. • Transfer learning in low-level vision problems.
  • 22. Deep Convolution Networks for Compression Artifacts Reduction There are two main modifications based on the AR-CNN. First, the layer decomposition splits the original “feature enhancement” layer into a “shrinking” layer and an “enhancement” layer. Then the large-stride convolutional and deconvolutional layers significantly decrease the spatial size of the feature maps of the middle layers. The overall shape of the framework is like an hourglass, which is thick at the ends and thin in the middle.
  • 23. Automatic Photo Adjustment Using Deep Learning • Explore the use of deep learning in the context of photo editing; • Introduce an image descriptor (pixel, context and global) that accounts for the local semantics of an image. Middle (from top to bottom): input image, semantic label map and the ground truth for the Local Xpro effect; Left and right: color mapping scatter plots for four semantic regions.
  • 24. Automatic Photo Adjustment Using Deep Learning The architecture of the DNN Multi-scale spatial pooling schema Pipeline for constructing the semantic label map
  • 25. Automatic Photo Adjustment Using Deep Learning Three Stylistic Local Effects: 1. Local Xpro, 2. Foreground Pop-Out, 3. Watercolor.
  • 26. Deep Bilateral Learning for Real-Time Image Enhancement • Inspired by bilateral grid processing and local affine color transforms. • Using pairs of input/output images, train a CNN to predict the coefficients of a locally-affine model in bilateral space. • Learn to make content-dependent decisions to approximate the desired image transformation. • The NN consumes a LR version of the input image, produces a set of affine transformations in bilateral space, upsamples those transformations in an edge-preserving fashion using a new slicing node, and then applies those upsampled transformations to the full-resolution image.
  • 27. Deep Bilateral Learning for Real-Time Image Enhancement Perform as much computation as possible at a low resolution, while still capturing high-frequency effects at full image resolution. It consists of two distinct streams operating at different resolutions. The LR stream processes a downsampled version of the input I through several conv. layers so as to estimate a bilateral grid of affine coefficients. This LR stream is further split in two paths to learn both local and global features, which are fused before making the final prediction. The global and local paths share a common set of low-level features. In turn, the HR stream performs a minimal yet critical amount of work: it learns a grayscale guidance map used by our new slicing node to upsample the grid of affine coefficients back to full-resolution. These per-pixel local affine transformations are then applied to the full-resolution input, which yields the final output.
  • 28. Deep Edge aware filters • To learn a big important family of edge-aware operators from data. • Based on a deep CNN with a gradient domain training procedure, to approximate various filters without knowing the original models. • Enable fast approximation for complex edge-aware filters and achieves up to 200x acceleration. • Using spatially varying filter or filter combination. FW(I) - a unified feed-forward process, I - input image, F - network architecture, W - network parameters. edge-aware filtering operators - L(I)
  • 29. Deep Edge aware filters A unified learning pipeline for various edge-aware filtering techniques.
  • 30. Deep Edge aware filters
  • 31. Deep Depth Super-Resolution : Learning Depth Super-Resolution using Deep CNN • Learn the mapping from a low resolution depth image to a high resolution one in an end-to-end style. • To better regularize the learned depth map, exploit the depth field statistics and the local correlation btw depth image and color image. • These priors are integrated in an energy minimization formulation, where the deep NN learns the unary term, the depth field statistics works as global model constraint and the color depth correlation is utilized to enforce the local structure in depth images. P extracts the gradients along X and Y directions. The color modulated smoothness term The total variation Energy minimization
  • 32. Deep Depth Super-Resolution : Learning Depth Super-Resolution using Deep CNN CNN gradually learns the high frequency components in depth images
  • 33. Depth Map Super-Resolution by Deep Multi-Scale Guidance • Depth map super resolution in which a HR depth map is inferred from a LR depth map and an additional HR intensity image of the same scene. • Multi-Scale Guided convolutional network (MSG-Net) for depth map SR. • MSG-Net complements LR depth features with HR intensity features using a multi-scale fusion strategy. • Such a multi-scale guidance allows the network to better adapt for upsampling of both fine- and large-scale structures. • Specifically, the rich hierarchical HR intensity features at different levels progressively resolve ambiguity in depth map upsampling. • A high-frequency domain training method to not only reduce training time but also facilitate the fusion of depth and intensity features.
  • 34. Depth Map Super-Resolution by Deep Multi-Scale Guidance
  • 35. Depth Map Super-Resolution by Deep Multi-Scale Guidance
  • 36. Depth Map Super-Resolution by Deep Multi-Scale Guidance The network architecture of MS-Net for single-image super resolution.
  • 37. Depth Map Super-Resolution by Deep Multi-Scale Guidance (a) Color image. (b) Ground truth. (c) LR by 8. (d) SRCNN (e) MSG-Net
  • 38. Accelerating the Super-Resolution Convolutional Neural Network • A compact hourglass-shape CNN structure for faster and better Super- Resolution Convolutional Neural Network (SRCNN). • Introduce a deconvolution layer, then the mapping is learned directly from the original LR image (without interpolation) to the HR one. • Reformulate the mapping layer by shrinking the input feature dimension before mapping and expanding back afterwards. • Adopt smaller filter sizes but more mapping layers.
  • 39. Accelerating the Super-Resolution Convolutional Neural Network The FSRCNN consists of convolution layers and a deconvolution layer. The convolution layers can be shared for different upscaling factors. A specific deconvolution layer is trained for different upscaling factors.
  • 40. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution • Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct sub-band residuals of HR images. • At each pyramid level, the model takes coarse-resolution feature maps as input, predicts the HF residuals, and uses transposed convolutions for upsampling to the finer level. • Not require bicubic interpolation as the pre-processing step. • Train the LapSRN with deep supervision using a Charbonnier loss function and achieve high-quality reconstruction. • The network generates multi-scale predictions in one feed- forward pass through the progressive reconstruction, thereby facilitates resource-aware applications.
  • 41. Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution
  • 42. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network • SRGAN, a generative adversarial network (GAN) for image superresolution (SR). • Capable of inferring photo-realistic natural images for 4 upscaling factors. • A perceptual loss function which consists of an adversarial loss and a content loss. – The adversarial loss pushes the solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. – A content loss motivated by perceptual similarity instead of similarity in pixel space. • The deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks.
  • 43. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
  • 44. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
  • 45. Deep Joint Image Filtering • Learning-based to construct a joint filter based on CNN. • In contrast to considering only the guidance image, it selectively transfers salient structures that are consistent in both guidance and target images. – The sub-networks CNNT and CNNG aim to extract informative feature responses from the target and guidance images, respectively. – These responses are concatenated together as input for network CNNF. – Finally, model CNNF reconstructs the desired output by selectively transferring main structures while suppressing inconsistent structures.
  • 46. Deep Joint Image Filtering The model consists of three major components Given M training image samples minimizing the summed squared loss
  • 47. Deep Joint Image Filtering Joint depth upsampling (8×) using different network architectures f1-f2-... where fi is the filter size of the i-th layer. (a) GT depth map (inset: Guidance). (b) Bicubic upsampling. (c)-(e) using CNNF. (f) using CNNT + CNNG + CNNF.
  • 48. • Integration from multiple scales and semantic levels via multi-streams of interlinked, layered, non-linear “deep” processing; – Deep belief net with a variant of the mean-and-covariance RBM; • Unsupervised feature learning; – Supervised boundary prediction by feed forward NN. Deep Neural Prediction Network for Visual Boundary
  • 49. Deep Neural Prediction Network for Visual Boundary
  • 50. DeepContour: A Deep Convolutional Feature Learned by Positive-sharing Loss for Contour Detection CNN structure: explicitly visualizing the dimensions of each network layers. • Contour detection accuracy can be improved by instead making the use of the deep features learned from CNNs. • Customize the training strategy by partitioning contour (positive) data into subclasses and fitting each subclass by different model parameters. • A new loss function, named positive-sharing loss, in which each subclass shares the loss for the whole positive class to learn the parameters • It introduces an extra regularizer to emphasizes the losses for the positive and negative classes, which facilitates to explore more discriminative features.
  • 51. DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection • Run the Canny edge detector to get candidate contour points. • Around each candidate point, extract patches at four different scales and simultaneously run them through the five convolutional layers of the KNet. • Connect these convolutional layers to two separately-trained network branches. • The first branch is trained for classification, the second is trained as a regressor. • Outputs from these two sub-networks are averaged to produce the final score.
  • 52. Holistically-Nested Edge Detection • An edge detection algorithm that addresses two important issues: (1) holistic image training and prediction; and (2) multi-scale and multi-level feature learning. • Holistically-nested edge detection (HED), performs image-to-image prediction by means of a deep learning model that leverages fully CNNs and deeply-supervised nets. • HED automatically learns rich hierarchical representations to resolve the challenging ambiguity in edge/boundary detection. (a) multi-stream architecture; (b) skip-layer net architecture; (c) single model on multi-scale inputs; (d) separate training of networks; (e) holistically-nested architectures.
  • 53. Holistically-Nested Edge Detection The receptive field and stride size in VGGNet used in HED. Deep supervision with side output layers to produce multi-scale dense predictions. Left: the side outputs become progressively coarser and more “global”, while critical object boundaries are preserved. Right: the predictions tends to lack any discernible order (e.g. in layers 1 and 2), and many boundaries are lost in later stages.
  • 54. BOUNDARY DETECTION USING DEEP LEARNING A image is processed at 3 different scales in order to obtain multi-scale information. The 3 scales are fused and sent as input to the NCuts, that delivers eigenvectors and the resulting ‘Spectral Boundaries’. The latter are fused with the original boundary map, non-maximum suppressed, and optionally thresholded.
  • 55. BOUNDARY DETECTION USING DEEP LEARNING Network architecture for multi-resolution HED training: 3 differently scaled versions of the input image are provided as inputs to 3 FCNN networks that share weights - their multi-resolution outputs are fused in a late fusion stage, extending DSN to multi-resolution training.
  • 56. • A deep learning algorithm for contour detection with a fully convolutional encoder-decoder network; • Different from previous low-level edge detection, focuses on detecting higher-level object contours. • Trained e2e with refined ground truth from inaccurate polygon annotations, yielding much higher precision in object contour detection; • Learned model generalizes well to unseen object classes from the same super-categories on MS COCO and can match state-of-the-art edge detection on BSDS500 with fine-tuning. • By combining with the multiscale combinatorial grouping algorithm, generate high-quality segmented object proposals, which significantly advance the state-of-the-art with a relatively small amount of candidates. Object Contour Detection with a Full Conv. Encoder-Decoder Network
  • 57. Object Contour Detection with a Full Conv. Encoder-Decoder Network Architecture of the fully convolutional encoder-decoder network
  • 58. Context Encoders: Feature Learning by Inpainting • Unsupervised feature learning driven by context-based pixel prediction. • By analogy with auto-encoders, Context Encoders – a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. • Context encoders need to both understand the content of the entire image, as well as produce a plausible hypothesis for the missing part(s). • When training context encoders, both a standard pixel-wise reconstruction loss, as well as a reconstruction plus an adversarial loss. • A context encoder learns a representation that captures not just appearance but also the semantics of visual structures.
  • 59. Context Encoders: Feature Learning by Inpainting (a) Context encoder trained with joint reconstruction and adversarial loss for semantic inpainting.
  • 60. Context Encoders: Feature Learning by Inpainting (b) Context encoder trained with reconstruction loss for feature learning by filling in arbitrary region dropouts in the input.
  • 61. High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis • A multi-scale neural patch synthesis approach based on joint optimization of image content and texture constraints, which not only preserves contextual structures but also produces high-frequency details by matching and adapting patches with the most similar mid-layer feature correlations of a deep classification network. Solve for an unknown image x using two loss functions, the holistic content loss (Ec) and the local texture loss (Et).
  • 62. High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis The network architecture for structured content prediction. Unlike the L2 loss architecture, replace all ReLU/ReLU leaky layers with the ELU layer and adopted fully-connected layers instead of channel-wise fully-connected layers. The ELU unit makes the regression network training more stable than the ReLU leaky layers as it can handle large negative responses during the training process.
  • 63. Semantic Image Inpainting with Deep Generative Models • It generates the missing content by conditioning on the available data. • Given a trained generative model, search for the closest encoding of the corrupted image in the latent image manifold using context and prior losses. • This encoding is then passed through the generative model to infer the missing content. • Inference is possible irrespective of how the missing content is structured, while the SoA learning based method requires specific info. about the holes in the training phase. • It successfully predicts info. in large missing regions and achieves pixel-level photorealism.
  • 64. Semantic Image Inpainting with Deep Generative Models Framework for inpainting. (a) Given a GAN model trained on real images, iteratively update z to find the closest mapping on the latent image manifold, based on the designed loss functions. (b) Manifold traversing when iteratively updating z using BP. z (0) is random initialed; z (k) denotes the result in k-th iteration; and zˆ the final solution.
  • 65. Semantic Image Inpainting with Deep Generative Models CE: Contextual Encoder, GAN: Generative Adversarial Network
  • 66. Globally and Locally Consistent Image Completion • With a FCN, complete images of arbitrary resolutions by filling- in missing regions of any shape. • To train this image completion network to be consistent, use global and local context discriminators that are trained to distinguish real images from completed ones. • The global discriminator looks at the entire image to assess if it is coherent as a whole, while the local discriminator looks only at a small area centered at the completed region to ensure the local consistency of the generated patches. • The network is trained to fool both context discriminator networks, which requires it to generate images that are indistinguishable from real ones with regard to overall consistency as well as in details.
  • 67. Globally and Locally Consistent Image Completion Overview of learning image completion. It consists of a completion network and two auxiliary context discriminator networks that are used only for training the completion network. The global discriminator network takes the entire image as input, while the local discriminator network takes only a small region around the completed area as input. Both discriminator networks are trained to determine if an image is real or completed by the completion network, while the completion network is trained to fool both discriminator networks.
  • 68. Globally and Locally Consistent Image Completion Architecture of the image completion network. Architectures of the discriminators used in the model.
  • 69. Globally and Locally Consistent Image Completion
  • 70. Generative Face Completion • Face completion using a deep generative model, a more challenging problem; • To generate semantically new pixels for the missing key components (e.g., eyes and mouths) that contain large appearance variations. • Directly generates contents for missing regions based on a neural network. • The model is trained with a combination of a reconstruction loss, two adversarial losses and a semantic parsing loss, which ensures pixel faithfulness and local-global contents consistency.
  • 71. Generative Face Completion Network architecture. It consists of one generator, two discriminators and a parsing network. The generator takes the masked image as input and outputs the generated image. Two discriminators are learned to distinguish the synthesize contents in the mask and whole generated image as real and fake. The parsing network, which is a pretrained model and remains fixed, is to further ensure the new generated contents more photo- realistic and encourage consistency between new and old pixels. Note that only the generator is needed during the testing.
  • 72. Convolutional Neural Pyramid for Image Processing • A principled convolutional neural pyramid (CNP) framework for general low-level vision and image processing tasks. • The pyramid structure can greatly enlarge the field while not sacrificing computation efficiency. • Adaptive network depth and progressive upsampling for quasi-real-time testing on VGA-size input. • A broad set of applications, i.e. depth/RGB image restoration, completion, noise/artifact removal, edge refinement, image filtering, enhancement and colorization.
  • 73. Convolutional Neural Pyramid for Image Processing Illustration of convol. neural pyramid. (a) shows the convol. pyramid structure. (b) and (c) are the feature extraction and mapping components respectively. Conv(x, y) denotes the convolution operation, where x is the kernel size and y is the number of output.
  • 74. Convolutional Neural Pyramid for Image Processing
  • 75. Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network • Low-level vision problems (e.g., edge-preserving filtering and denoising) as recursive image filtering via a hybrid neural network. • The network contains several spatially variant RNNs as equivalents of a group of distinct recursive filters for each pixel, and a deep CNN that learns the weights of RNNs. • The deep CNN can learn regulations of recurrent propagation for various tasks and effectively guides recurrent propagation over an entire image. • The model does not need a large number of convolutional channels nor big kernels to learn features for low-level vision filters. • It is significantly smaller and faster in comparison with a deep CNN based image filter.
  • 76. Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network An illustrative example of the Hybrid NN for edge-preserving image smoothing with a single RNN.
  • 77. Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network The hybrid network that contains a group of RNNs to filter/restore an image and a deep CNN to learn to propagate the RNNs. The process of filtering/restoration is carried out via RNNs with two inputs and one output result. Both parts are trained jointly in an end-to-end fashion.
  • 78. Colorful Image Colorization • A fully automatic approach that produces vibrant and realistic colorizations. • A classification task and use class-rebalancing at training time to increase the diversity of colors in the result. • Colorization can be a powerful pretext task for self-supervised feature learning, acting as a cross-channel encoder. The network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and ReLU layers, followed by a BatchNorm layer. The net has no pool layers. All changes in resolution are achieved through spatial downsampling or upsampling btw conv blocks.
  • 79. Colorful Image Colorization Classification loss with rebalancing produces more accurate and vibrant results than a regression loss or a classification loss without rebalancing.
  • 81. Deep Learning • Representation learning attempts to automatically learn good features or representations; • Deep learning algorithms attempt to learn multiple levels of representation of increasing complexity/abstraction (intermediate and high level features); • Become effective via unsupervised pre-training + supervised fine tuning; – Deep networks trained with back propagation (without unsupervised pre- training) perform worse than shallow networks. • Deal with the curse of dimensionality (smoothing & sparsity) and over- fitting (unsupervised, regularizer); • Semi-supervised: structure of manifold assumption; – labeled data is scarce and unlabeled data is abundant.
  • 82. Why Deep Learning? • Supervised training of deep models (e.g. many-layered Nets) is too hard (optimization problem); • Learn prior from unlabeled data; • Shallow models are not for learning high-level abstractions; • Ensembles or forests do not learn features first; • Graphical models could be deep net, but mostly not. • Unsupervised learning could be “local-learning”; • Resemble boosting with each layer being like a weak learner • Learning is weak in directed graphical models with many hidden variables; • Sparsity and regularizer. • Traditional unsupervised learning methods aren’t easy to learn multiple levels of representation. • Layer-wised unsupervised learning is the solution. • Multi-task learning (transfer learning and self taught learning); • Other issues: scalability & parallelism with the burden from big data.
  • 83. Multi Layer Neural Network • A neural network = running several logistic regressions at the same time; – Neuron=logistic regression or… • Calculate error derivatives (gradients) to refine: back propagate the error derivative through model (the chain rule) – Online learning: stochastic/incremental gradient descent; – Batch learning: conjugate gradient descent.
  • 84. Convolutional Neural Networks (CNN) • CNN is a special kind of multi-layer NNs applied to 2-d arrays (usually images), based on spatially localized neural input; – local receptive fields(shifted window), shared weights (weight averaging) across the hidden units, and often, spatial or temporal sub-sampling; – Related to generative MRF/discriminative CRF: • CNN=Field of Experts MRF=ML inference in CRF; – Generate ‘patterns of patterns’ for pattern recognition. • Each layer combines (merge, smooth) patches from previous layers – Pooling /Sampling (e.g., max or average) filter: compress and smooth the data. – Convolution filters: (translation invariance) unsupervised; – Local contrast normalization: increase sparsity, improve optimization/invariance. C layers convolutions, S layers pool/sample
  • 85. Convolutional Neural Networks (CNN) • Convolutional Networks are trainable multistage architectures composed of multiple stages; • Input and output of each stage are sets of arrays called feature maps; • At output, each feature map represents a particular feature extracted at all locations on input; • Each stage is composed of: a filter bank layer, a non-linearity layer, and a feature pooling layer; • A ConvNet is composed of 1, 2 or 3 such 3-layer stages, followed by a classification module; – A fully connected layer: softmax transfer function for posterior distribution. • Filter: A trainable filter (kernel) in filter bank connects input feature map to output feature map; • Nonlinearity: a pointwise sigmoid tanh() or a rectified sigmoid abs(gi•tanh()) function; – In rectified function, gi is a trainable gain parameter, might be followed a contrast normalization N; • Feature pooling: treats each feature map separately -> a reduced-resolution output feature map; • Supervised training is performed using a form of SGD to minimize the prediction error; – Gradients are computed with the back-propagation method. • Unsupervised pre-training: predictive sparse decomposition (PSD), then supervised fine- tuning. * is discrete convolution operator
  • 86.
  • 87. LeNet (LeNet-5) • A layered model composed of convolution and subsampling operations followed by a holistic representation and ultimately a classifier for handwritten digits; • Local receptive fields (5x5) with local connections; • Output via a RBF function, one for each class, with 84 inputs each; • Learning by Graph Transformer Networks (GTN);
  • 88. AlexNet • A layered model composed of convol., subsample., followed by a holistic representation and all-in-all a landmark classifier; • Consists of 5 convolutional layers, some of which followed by max-pooling layers, 3 fully-connected layers with a final 1000- way softmax; • Fully-connected “FULL” layers: linear classifiers/matrix multiplications; • ReLU are rectified-linear nonlinearities on layer output, can be trained several times faster; • Local normalization scheme aids generalization; • Overlapping pooling slightly less prone to overfitting; • Data augmentation: artificially enlarge the dataset using label- preserving transformations; • Dropout: setting to zero output of each hidden neuron with prob. 0.5; • Trained by SGD with batch # 128, momentum 0.9, weight decay 0.0005.
  • 89. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264-4096–4096–1000. AlexNet
  • 90. MattNet • Matthew Zeiler from the startup company “Clarifai”, winner of ImageNet Classification in 2013; • Preprocessing: subtracting a per-pixel mean; • Data augmentation: downsampled to 256 pixels and a random 224 pixel crop is taken out of the image and randomly flipped horizontally to provide more views of each example; • SGD with min-batch # 128, learning rate annealing, momentum 0.9 and dropout to prevent overfitting; • 65M parameters trained for 12 days on a single Nvidia GPU; • Visualization by layered DeconvNets: project the feature activations back to the input pixel space; – Reveal input stimuli exciting individual feature maps at any layer; – Observe evolution of features during training; – Sensitivity analysis of the classifier output by occluding portions to reveal which parts of scenes are important; • DeconvNet attached to each of ConvNet layer, unpooling uses locations of maxima to preserve structure; • Multiple such models were averaged together to further boost performance; • Supervised pre-training with AlexNet, then modify it to get better performance (error rate 14.8%).
  • 91. Architecture of an eight layer ConvNet model. Input: 224 by 224 crop of an image (with 3 color planes). # 1-5 layers Convolution: 96 filters, 7x7, stride of 2 in both x and y. Feature maps: (i) via a rectified linear function, (ii) 3x3 max pooled (stride 2), (iii) contrast normalized 55x55 feature maps. # 6-7 layers: fully connected, input in vector form (6x6x256 = 9216 dimensions). The final layer: a C-way softmax function, C - number of classes. MattNet
  • 92. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet will reconstruct approximate version of convnet features from the layer beneath. Bottom: Unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet. MattNet
  • 93. Deep Belief Networks • A hybrid model: can be trained as generative or discriminative model; • Deep architecture: multiple layers (learn features layer by layer); • Multi layer learning is difficult in sigmoid belief networks. • Top two layers are undirected connections, Restricted Boltzmann Machine (RBM); • Lower layers get top down directed connections from layers above; • Unsupervised or self-taught pre-learning provides a good initialization; • Greedy layer-wise unsupervised training for RBM; • Supervised fine-tuning • Generative: wake-sleep algorithm (Up-down); • Discriminative: back propagation (bottom-up); Belief net is directed acyclic graph composed of stochastic variables.
  • 94. Deep Boltzmann Machine • Boltzmann machine is a stochastic recurrent model, and RBM is its special case (one hidden layer); • Learning internal representations that become increasingly complex; • High-level representations built from a large supply of unlabeled inputs; • Pre-training: learning a stack of modified RBMs, which are composed to create a deep Boltzmann machine (undirected graph); • Generative fine-tuning: different from DBN • Positive and negative phase • Discriminative fine-tuning: the same to DBN • Back propagation.
  • 95. Stacked Denoising Auto-Encoder • Denoising Auto-Encoder: Multilayer NNs with target output=input; • Auto-encoder learns the salient variation like a nonlinear PCA; • Stack many (may be sparse) auto-encoders in succession and train them using greedy layer-wise unsupervised learning • Drop the decode layer each time • Performs better than stacking RBMs; • Supervised training on the last layer using final features; • (option) Supervised training on the entire network to fine- tune all weights of the neural net; • Empirically not quite as accurate as DBNs.
  • 96. Stochastic Gradient Descent (SGD) • The general class of estimators that arise as minimizers of sums are called M-estimators; • Where are stationary points of the likelihood function (or zeroes of its derivative, the score function)? • Online gradient descent samples a subset of summand functions at every step; • The true gradient of is approximated by a gradient at a single example; • Shuffling of training set at each pass. • There is a compromise between two forms, often called "mini-batches", where the true gradient is approximated by a sum over a small number of training examples. • STD converges almost surely to a global minimum when the objective function is convex or pseudo-convex, and otherwise converges almost surely to a local minimum.
  • 97. Back Propagation E (f(x0,w),y0) = -log (f(x0,w)- y0).
  • 98. Loss function • Euclidean loss is used for regressing to real-valued lables [-inf,inf]; • Sigmoid cross-entropy loss is used for predicting K independent probability values in [0,1]; • Softmax (normalized exponential) loss is predicting a single class of K mutually exclusive classes; – Generalization of the logistic function that "squashes" a K-dimensional vector of arbitrary real values z to a K-dimensional vector of real values σ(z) in the range (0, 1). – The predicted probability for the j'th class given a sample vector x is • Sigmoidal or Softmax normalization is a way of reducing the influence of extreme values or outliers in the data without removing them from the dataset.
  • 99. Variable Learning Rate • Too large learning rate – cause oscillation in searching for the minimal point • Too slow learning rate – too slow convergence to the minimal point • Adaptive learning rate – At the beginning, the learning rate can be large when the current point is far from the optimal point; – Gradually, the learning rate will decay as time goes by. • Should not be too large or too small: – annealing rate 𝛼(𝑡)=𝛼(0)/(1+𝑡/𝑇) – 𝛼(𝑡) will eventually go to zero, but at the beginning it is almost a constant.
  • 102. Data Augmentation for Overfitting • The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations; • Perturbing an image I by transformations that leave the underlying class unchanged (e.g. cropping and flipping) in order to generate additional examples of the class; • Two distinct forms of data augmentation: – image translation – horizontal reflections – changing RGB intensities
  • 103. Weight Decay for Overfitting • Weight decay or L2 regularization adds a penalty term to the error function, a term called the regularization term: the negative log prior in Bayesian justification, – Weight decay works as rescaling weights in the learning rule, but bias learning still the same; – Prefer to learn small weights, and large weights allowed if improving the original cost function; – A way of compromising btw finding small weights and minimizing the original cost function; • In a linear model, weight decay is equivalent to ridge (Tikhonov) regression; • L1 regularization: the weights not really useful shrink by a constant amount toward zero; – Act like a form of feature selection; – Make the input filters cleaner and easier to interpret; • L2 regularization penalizes large values strongly while L1 regularization ; • Markov Chain Monte Carlo (MCMC): simulating a Markov chain whose equilibrium distr. is the posterior distribution for weights & hyper-parameters; • Hybrid Monte Carlo: gradient and sampling.
  • 104. Early Stopping for Overfitting • Steps in early stopping: – Divide the available data into training and validation sets. – Use a large number of hidden units. – Use very small random initial values. – Use a slow learning rate. – Compute the validation error rate periodically during training. – Stop training when the validation error rate "starts to go up". • Early stopping has several advantages: – It is fast. – It can be applied successfully to networks in which the number of weights far exceeds the sample size. – It requires only one major decision by the user: what proportion of validation cases to use. • Practical issues in early stopping: – How many cases do you assign to the training and validation sets? – Do you split the data into training and validation sets randomly or by some systematic algorithm? – How do you tell when the validation error rate "starts to go up"?
  • 105. Dropout and Maxout for Overfitting • Dropout: set the output of each hidden neuron to zero w.p. 0.5. – Motivation: Combining many different models that share parameters succeeds in reducing test errors by approximately averaging together the predictions, which resembles the bagging. – The units which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation. – So every time an input is presented, the NN samples a different architecture, but all these architectures share weights. – This technique reduces complex co-adaptations of units, since a neuron cannot rely on the presence of particular other units. – It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other units. – Without dropout, the network exhibits substantial overfitting. – Dropout roughly doubles the number of iterations required to converge. • Maxout takes the maximum across multiple feature maps;
  • 106. MCMC Sampling for Optimization • Markov Chain: a stochastic process in which future states are independent of past states but the present state. – Markov chain will typically converge to a stable distribution. • Monte Carlo Markov Chain: sampling using ‘local’ information – Devise a Markov chain whose stationary distribution is the target. • Ergodic MC must be aperiodic, irreducible, and positive recurrent. – Monte Carlo Integration to get quantities of interest. • Metropolis-Hastings method: sampling from a target distribution – Create a Markov chain whose transition matrix does not depend on the normalization term. – Make sure the chain has a stationary distr. and it is equal to the target distr. (accept ratio). – After sufficient number of iterations, the chain will converge the stationary distribution. • Gibbs sampling is a special case of M-H Sampling. – The Hammersley-Clifford Theorem: get the joint distr. from the complete conditional distr. • Hybrid Monte Carlo: gradient sub step for each Markov chain.
  • 107. Mean Field for Optimization • Variational approximation modifies the optimization problem to be tractable, at the price of approximate solution; • Mean Field replaces M with a (simple) subset M(F), on which A* (μ) is a closed form (Note: F is disconnected graph); – Density becomes factorized product distribution in this sub-family. – Objective: K-L divergence. • Mean field is a structured variation approximation approach: – Coordinate ascent (deterministic); • Compared with stochastic approximation (sampling): – Faster, but maybe not exact.
  • 108. Contrastive Divergence for RBMs • Contrastive divergence (CD) is proposed for training PoE first, also being a quicker way to learn RBMs; – Contrastive divergence as the new objective; – Taking gradients and ignoring a term which is usually very small. • Steps: – Start with a training vector on the visible units. – Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel. • Can be applied using any MCMC algorithm to simulate the model (not limited to just Gibbs sampling); • CD learning is biased: not work as gradient descent • Improved: Persistent CD explores more modes in the distribution – Rather than from data samples, begin sampling from the mode samples, obtained from the last gradient update. – Still suffer from divergence of likelihood due to missing the modes. • Score matching: the score function does not depend on its normal. factor. So, match it b.t.w. the model with the empirical density.
  • 109. “Wake-Sleep” Algorithm for DBN • Pre-trained DBN is a generative model; • Do a stochastic bottom-up pass (wake phase) – Get samples from factorial distribution (visible first, then generate hidden); – Adjust the top-down weights to be good at reconstructing the feature activities in the layer below. • Do a few iterations of sampling in the top level RBM – Adjust the weights in the top-level RBM. • Do a stochastic top-down pass (sleep phase) – Get visible and hidden samples generated by generative model using data coming from nowhere! – Adjust the bottom-up weights to be good at reconstructing the feature activities in the layer above. – Any guarantee for improvement? No! • The “Wake-Sleep” algorithm is trying to describe the representation economical (Shannon’s coding theory).
  • 110. Greedy Layer-Wise Training • Deep networks tend to have more local minima problems than shallow networks during supervised training • Train first layer using unlabeled data – Supervised or semi-supervised: use more unlabeled data. • Freeze the first layer parameters and train the second layer • Repeat this for as many layers as desire – Build more robust features • Use the outputs of the final layer to train the last supervised layer (leave early weights frozen) • Fine tune the full network with a supervised approach; • Avoid problems to train a deep net in a supervised fashion. – Each layer gets full learning – Help with ineffective early layer learning – Help with deep network local minima
  • 111. Why Greedy Layer-Wise Training Works? • Take advantage of the unlabeled data; • Regularization Hypothesis – Pre-training is “constraining” parameters in a region relevant to unsupervised dataset; – Better generalization (representations that better describe unlabeled data are more discriminative for labeled data) ; • Optimization Hypothesis – Unsupervised training initializes lower level parameters near localities of better minima than random initialization can. • Only need fine tuning in the supervised learning stage.
  • 112. Generative Modeling • Have training examples x ~ pdata(x ) • Want a model that draw samples: x ~ pmodel(x ) • Where pmodel ≈ pdata • Conditional generative models – Speech synthesis: Text ⇒ Speech – Machine Translation: French ⇒ English • French: Si mon tonton tond ton tonton, ton tonton sera tondu. • English: If my uncle shaves your uncle, your uncle will be shaved – Image ⇒ Image segmentation • Environment simulator – Reinforcement learning – Planning • Leverage unlabeled data x ~ pdata(x ) x ~ pmodel(x )
  • 113. Adversarial Nets Framework • A game between two players: – 1. Discriminator D – 2. Generator G • D tries to discriminate between: – A sample from the data distribution. – And a sample from the generator G. • G tries to “trick” D by generating samples that are hard for D to distinguish from data.
  • 114. GANs • A framework for estimating generative models via an adversarial process, to train 2 models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. • The training procedure for G is to maximize the probability of D making a mistake. • This framework corresponds to a minimax two-player game: – In the space of arbitrary functions G and D, a unique solution exists, with G recovering training data distribution and D equal to 1/2 everywhere; – In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with BP. – There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples.
  • 115. GANs
  • 116. GANs
  • 117. GANs
  • 118. GANs Rightmost column shows the nearest training example of the neighboring sample, in order to demonstrate that the model has not memorized the training set. Samples are fair random draws, not cherry-picked. Unlike most other visualizations of deep generative models, these images show actual samples from the model distributions, not conditional means given samples of hidden units. Moreover, these samples are uncorrelated because the sampling process does not depend on Markov chain mixing. a) MNIST b) TFD c) CIFAR-10 (fully connected model) d) CIFAR-10 (convolutional discriminator and “deconvolutional” generator).
  • 119. How to Train a GAN? Tips and Tricks • 1. Normalize the inputs • 2: A modified loss function • 3: Use a spherical Z (not uniform, but Gaussian distribution) • 4: Batch Norm • 5: Avoid Sparse Gradients: – ReLU, MaxPool • 6: Use Soft and Noisy Labels • 7: DCGAN / Hybrid Models – KL + GAN or VAE + GAN • 8: Use stability tricks from RL • 9: Use the ADAM Optimizer for generator (SGD for discriminator) • 10: Track failures early – check norms of gradients • 11: Dont balance loss via statistics (unless you have a good reason to) • 12: If you have labels, use them – Auxillary GANs • 13: Add noise to inputs, decay over time • 14: [not sure] Train discriminator more (sometimes) especially have noise • 15: [not sure] Batch Discrimination • 16: Discrete variables in C-GANs • 17: Dropouts in G in both train/test stage
  • 120. Improved Techniques for Training GANs • For semi-supervised learning in generation of images that humans find visually realistic; • Techniques that are heuristically motivated to encourage convergence: – Feature matching addresses the instability of GANs by specifying a new objective for the generator that prevents it from overtraining on the current discriminator; – Allow the discriminator to look at multiple data examples in combination, and perform what is called “Min-batch discrimination”: any discriminator model that looks at multiple examples in combination, rather than in isolation, could potentially help avoid collapse of the generator; – Historical averaging: the historical average of the parameters can be updated in an online fashion so this learning rule scales well to long time series; – One sided label smoothing: reduce the vulnerability of NNs to adversarial examples; – Virtual batch normalization: each example x is normalized based on the statistics collected on a reference batch of examples that are chosen once and fixed at the start of training, and on x itself (only in the generator network, cause too expensive computationally).