文献紹介：Learnable Gated Temporal Shift Module for Free-form Video Inpainting

!"#$%#&'"()#*"+(,"-./$#'(
0123*(4/+5'"(3/$(6"".(72+"/(
8%.#2%*2%9
!"#$%"&'()*"&'+(,*- !.($%.+(/."&#!%&'($--+(0%&123&(41.+(
567)89:;
橋口凌大（名工大玉木研）
英語論文紹介 898:<::<=

概要
n$-">&"?@-(A"2-B(C-DE3>"@(F*%G2(63B.@- H$ACF6Iの提案
• 学習可能なシフトモジュール
• 8Jで時間モデリング
nKJと同等な精度
• KKLのパラメータと計算量
nCF6AMNの提案
• 自由形式のインペインティング
のモデル性能を大幅に向上
2 CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING
Figure 1: Our model takes videos with free-form masks (first row) and fills in the missing
areas with proposed LGTSM to generate realistic completed results (second row) compared
to the original videos (third row). It could be applied to video editing tasks such as video
object removal, as shown in the first two columns. Best viewed in color and zoom-in. See
corresponding videos in the following links: object removal, free-form masks, and faces.
usually have high computational complexity and the missing area may not be found for

画像のインペインティング
nA"2%&'(O3&P3@.2%3& Q!.R+(S))789:;T
• マスクされた領域
• インペインティングされた領域
• マスクされてない領域
• これらの差を学習する
nUB'-#)3&&-O2(QN"V->%R+(S))789:;T
• エッジを生成
• エッジを条件として
画像を復元
represents x-axis, y-axis of output map, kh and
kernel size (e.g. 3 × 3), k!
h = kh−1
2 , k!
w =
Rkh×kw×C!
×C
represents convolutional filters,
RC
and Oy,x ∈ RC!
are inputs and outputs. For
he bias in convolution is ignored.
ation shows that for all spatial locations (y, x),
ters are applied to produce the output in vanilla
al layers. This makes sense for tasks such as im-
cation and object detection, where all pixels of
e are valid, to extract local features in a sliding-
hion. However, for image inpainting, the input
ed of both regions with valid pixels/features out-
nd invalid pixels/features (in shallow layers) or
pixels/features (in deep layers) in masked re-
causes ambiguity during training and leads to
cts such as color discrepancy, blurriness and ob-
responses during testing, as reported in [23].
partial convolution is proposed [23] which
asking and re-normalization step to make the
Feature
Binary Mask
Binary Mask Feature
Rule-based
Update
Rule-based
Update
Conv.
Conv.
Conv.
Output
Feature
Conv.
Output
Soft Gating
Conv.
Soft Gating
Conv.
Feature
Conv.
Conv.
Learnable Gating/Feature
Rule-based Binary Mask
Figure 2: Illustration of partial convolution (left) and gated
convolution (right).
updated with rules, gated convolutions learn soft mask au-
tomatically from data. It is formulated as:
Gatingy,x =
# #
Wg · I
Featurey,x =
# #
Wf · I
Figure 2: Summary of our proposed method. Incomplete grayscale image and edge map, and mask are the inputs of G1 to
predict the full edge map. Predicted edge map and incomplete color image are passed to G2 to perform the inpainting task.

動画のインペインティング
n)3D?)N Q0"&'R+(MMMS89:;T
• 時間方向に一貫性のある
動画を生成するためにKJ()NN
• 8J()NNでリファイン
n提案手法
• 時間情報を8次元で統合的に扱う
• $ACF6の提案
(a) 3DCN (b) CombCN
Final output
Input Frame
Input Video
(down-sampled
version)
Note: The gray blocks representing “Input Video” and “Input Frame” contains 4 channels, R/G/B and mask.
H/2 × W/2
H/4 × W/4
H/8 × W/8
H/4 × W/4
H/2 × W/2
Dilated 3D Conv.
H × W
H/2 × W/2
H/4 × W/4
H/2 × W/2
H × W
Dilated Conv.
+
+
Figure 2: Network architecture of our 3D completion network (3DCN) and 3D-2D combined completion network (CombCN).
The 3DCN works in low resolution, producing an inpainted video as output. Its individual frames are further convolved and
added into the first and last layer of the same size in CombCN. The input video for 3DCN and the input frame for CombCN,
shown as gray blocks, are in 4-channel format, containing RGB and the mask indicating the holes to be filled.
produces an complete video Vout as output. The incomplete
video is represented as a F ⇥ H ⇥ W volume, where F, H,
W are number of frames, height and width of Vin. M and
Vout are of the same size as Vin. In training stage, Vin is
obtained by randomly generating holes on a complete video
Vc, i.e. the target ground truth of the inpainting task.
To produce a spatial-temporal coherent inpainted video,
our network consists of two sub-networks. One network is a
3D completion network (3DCN), which utilizes 3D CNN to
infer the temporal structure globally from a down-sampled
version (F ⇥ H
r ⇥ W
r ) of Vin and M. In this paper, we use
superscript d to distinguish the downsampled videos from
their original versions. So 3DCN takes V d
in and Md
as in-
put and produces V d
out as output. Another sub-network is a
3D-2D combined completion network (CombCN). It applies
2D CNN to perform completion frame by frame. The input
of this network includes one incomplete but high-resolution
frame of Vin with its mask frame of M, and a low-resolution
but complete frame Id,k
out of V d
out (k = 1, 2, ..., F). In our pa-
per, F, H, W, r are set to 32, 128, 128 and 2 respectively.
plete video and its mask as input, it first exploits 4 strided
convolutional layers to encode it to a latent space, captur-
ing its temporal-spatial structure. Then 3 dilated convolu-
tional layers with rate 2, 4, 8 are followed to capture the
spatial-temporal information in larger perception field. At
last, video inpainting is finally achieved by 3 convolutional
and 2 fractionally-strided convolutional layers in an alterna-
tive order, yielding the result with missing part filled. Rather
than using the max-pooling and upsampling layers to com-
pute the feature maps, we employ 3 ⇥ 3 convolution ker-
nels with stride of 2, which ensures that every pixel con-
tributes. Meanwhile, considering that non-successive frames
may have loose relations with the current one, and to avoid
information loss across frames, we limit stride and dila-
tion to take effects only within frame rather than across
frames. As a result, the feature map of each layer has a con-
stant frame number F. Besides, all the convolutional lay-
ers are followed by batch normalization (BN) and ReLU
non-linearity activation except the last one. Paddings are in-
volved to make the input and output have exactly the same
size. The skip-connections as U-Net (Ronneberger, Fischer,
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
We use kernel size (5⇥5) for the first convolutional layer, kernel size (4⇥4) with stride 2 for
the down-sampling layers, kernel size (3 ⇥ 3) with dilation of 2, 4, 8, 16 for dilated layers
and kernel size (3 ⇥ 3) for other convolutional layers in the generator. For non-linearity
layers, we use LeakyReLU. We use nearest up-sampling and convolutions for up-sampling
layers. GTSM/LGTSM is added on all convolutional layers except for the first one, as it is
the original RGB input.
1.2 Training Procedure
Please see the Setups(Training and testing) section in the main paper. At the first stage in
training, we found applying only l1 loss, perceptual loss and style loss enhances the training
speed and stability. After the first stage converges, we add the TSMGAN to fine-tune. We
set ll1
= 1, lperc = 1, lstyle = 10 and lG = 1. We use Adam optimizer [3] with learning rate
0.0001.
c 2019. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
Output video Ground truth
Losses
Input video & masks (a) Video Inpainting Generator
Temporal
Cannel
(b) TSM Patch Discriminator
Real or fake for
each feature point
Adversarial loss
Output video
Ground truth
Learnable Gated TSM
Convolution (LGTSM)
Dilated
LGTSM
Gated 2D
Convolution
Learnable TSM
Convolution
Normal 2D
Convolution

!"#$%
n単純なシフトだとマスクされた領域がシフトされる可能性
n学習可能なシフトカーネル
CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING 3
0
0
1
*
1
0
0
*
Backward
Shifting
Kernel
Forward
Shifting
Kernel
a1
a2
a3
T
C
H, W
(a) Features (b) Naive TSM (c) TSM by Shifting Kernels (c) Learnable Kernels
b1
b2
b3
Backward Shift
Forward Shift
Zero padding Truncated
* Convolution
Element-wise Addition Element-wise Multiplication
Shift Conv
X Y
(a) Residual TSM
Conv
Gating
X Y
(b) Gated Convolution
Learnable Shift Conv
Gating
X Y
(d) Learnable Gated TSM (Ours)
Shift Conv
Gating Y
(c) Gated TSM (Ours)
X
Figure 3: Module design. We integrate (a) Residual TSM [14] and (b) gated convolution [28]
to (c) Gated Temporal Shift Module (GTSM) and design learnable temporal shifting kernel
(Fig. 2c) to make (d) the proposed Learnable Gated Temporal Shift Module (LGTSM).
Outputt,x,y = s(Gatingt,x,y)f(Featurest,x,y) (3)
where s is the sigmoid function that transforms gating to values between 0 (invalid) and
1 (valid), and f is the activation function for the convolution. Note that the TSM module
could be easily modified to online settings (without peeking future frames) for real-time
applications.
Note that the temporal shifting operation in TSM is similar to applying forward/backward

#$%"&'
nCF6AMN
n各空間#時間特徴点について本物か偽物かの分類を学習
n高レベルな特徴
n高レベルの特徴の@311を下げることで効率的なモデルの改善
nモデル全体の@311
Shit Module integrated on both the generator and discriminator as stated in 3.1.
he TSMGAN discriminator is composed of six 2D convolutional layers with TSM.
we apply the recently proposed spectral normalization [16] to both the generator and
minator as [17] to enhance the training stability. The TSMGAN loss LD for the dis-
nator to tell if the input video z is real or fake and LG for the generator to fool the
minator are defined as:
LD = Ex⇠Pdata(x)[ReLU(1+D(x))]
+Ez⇠Pz(z)[ReLU(1 D(G(z)))]
(7)
LG = Ez⇠Pz(z)[D(G(z))] (8)
s [28], the kernel size is 5 ⇥ 5, stride 2 ⇥ 2 and the shifting operation is applied to all
nvolutional layers for the TSMGAN discriminator, so the receptive field of each output
re point includes the whole video. It could be viewed as several GANs on different
re points, and a local-global discriminator structure [29] is thus not needed.
esides, the TSMGAN learns to classify real or fake for each spatial-temporal feature
from the last convolutional layer in the discriminator, which mostly consists of high-
features. Since the l1 loss already focuses on low-level features, using TSMGAN could
ove the model in an efficient way.
all loss. The overall loss function to train the model is defined as:
Ltotal = ll1
Ll1
+lpercLperc +lstyleLstyle +lGLG (9)
e ll1
, lperc, lstyle and lG are the weights for l1 loss, perceptual loss, style loss and
GAN loss, respectively.
TSMGAN loss. All the aforementioned loss functions are for image only, which do not
take temporal consistency into consideration. Therefore, we develop TSMGAN to learn
temporal consistency. We set up a generative adversarial network (GAN) with Gated Tem-
poral Shit Module integrated on both the generator and discriminator as stated in 3.1.
The TSMGAN discriminator is composed of six 2D convolutional layers with TSM.
Also, we apply the recently proposed spectral normalization [16] to both the generator and
discriminator as [17] to enhance the training stability. The TSMGAN loss LD for the dis-
criminator to tell if the input video z is real or fake and LG for the generator to fool the
discriminator are defined as:
(7)
LG = Ez⇠Pz(z)[D(G(z))] (8)
As [28], the kernel size is 5 ⇥ 5, stride 2 ⇥ 2 and the shifting operation is applied to all
11 convolutional layers for the TSMGAN discriminator, so the receptive field of each output
feature point includes the whole video. It could be viewed as several GANs on different
feature points, and a local-global discriminator structure [29] is thus not needed.
Besides, the TSMGAN learns to classify real or fake for each spatial-temporal feature
point from the last convolutional layer in the discriminator, which mostly consists of high-
level features. Since the l1 loss already focuses on low-level features, using TSMGAN could
improve the model in an efficient way.
Overall loss. The overall loss function to train the model is defined as:
Ltotal = ll1
Ll1
where ll1
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
set ll1
0.0001.
Losses
Temporal
Cannel
Real or fake for
each feature point
Adversarial loss
Output video
Ground truth
Learnable Gated TSM
Convolution (LGTSM)
Dilated
LGTSM
Gated 2D
Convolution
Learnable TSM
Convolution
Normal 2D
Convolution
Figure 1: Overall model architecture.

実験設定
n$:(@311+(E->O-E2."@(@311+(12W@-(@311を学
習させる
n収束したのちにCF6AMNを加えてファイ
ンチューニング
n学習率
• 9X999:
n最適化
• MB"D
n損失の重み
The TSMGAN discriminator is composed of six 2D convolutional layers with TSM.
, we apply the recently proposed spectral normalization [16] to both the generator and
iminator as [17] to enhance the training stability. The TSMGAN loss LD for the dis-
inator to tell if the input video z is real or fake and LG for the generator to fool the
iminator are defined as:
(7)
LG = Ez⇠Pz(z)[D(G(z))] (8)
As [28], the kernel size is 5 ⇥ 5, stride 2 ⇥ 2 and the shifting operation is applied to all
onvolutional layers for the TSMGAN discriminator, so the receptive field of each output
re point includes the whole video. It could be viewed as several GANs on different
re points, and a local-global discriminator structure [29] is thus not needed.
Besides, the TSMGAN learns to classify real or fake for each spatial-temporal feature
t from the last convolutional layer in the discriminator, which mostly consists of high-
features. Since the l1 loss already focuses on low-level features, using TSMGAN could
ove the model in an efficient way.
rall loss. The overall loss function to train the model is defined as:
Ltotal = ll1
Ll1
e ll1
MGAN loss, respectively.
<latexit sha1_base64="0ko7SN6bm3Cb1YLrzxojUCpzCpw=">AAACsHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQt4x+QAFackxlfnGNYq2CoY6nDBRQpSi5LRxYpLKnNSwYIGSKLuYJF4AWUDPQMwUMBkGEIZygxQEJAvsJwhhiGFIZ8hmaGUIZchlSGPoQTIzmFIZCgGwmgGQwYDhgKgWCxDNVCsCMjKBMunMtQycAH1lgJVpQJVJAJFs4FkOpAXDRXNA/JBZhaDdScDbckB4iKgTgUGVYOrBisNPhucMFht8NLgD06zqsFmgNxSCaSTIHpTC+L5uySCvxPUlQukSxgyELrwurmEIY3BAuzWTKDbC8AiIF8kQ/SXVU3/HGwVpFqtZrDI4DXQ/QsNbhocBvogr+xL8tLA1KDZDFzACDBED25MRpiRnqGZnkmgibKDEzQqOBikGZQYNIDhbc7gwODBEMAQCrR3I8N1hgcMD5mMmCKY4pkSIUqZGKF6hBlQAFMWAC4rpDg=</latexit>
l1 = 1, perc = 1, style = 10, G = 1
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
the original RGB input.
1.2 Training Procedure
Please see the Setups(Training and testing) section in the main paper. At the first stage in
training, we found applying only l1 loss, perceptual loss and style loss enhances the training
set ll1
0.0001.
Losses
Temporal
Cannel
Real or fake for
each feature point
Adversarial loss
Output video
Ground truth
Learnable Gated TSM
Convolution (LGTSM)
Dilated
LGTSM
Gated 2D
Convolution
Learnable TSM
Convolution
Normal 2D
Convolution
Figure 1: Overall model architecture.

実験設定
n画質
• 6-"&(FY.">-B(U>>3>(H6FUI
• $-">&-B(Z->O-E2."@(SD"'-(Z"2O*(
F%D%@">%2W(H$ZSZFI
nビデオ品質と時間的整合性
• [>-O*@-2 S&O-E2%3&(J%12"&O-(H[SJI
nベースライン
• C))JF(Q4."&'R+(M)6CA89:]T
• UB'-#)3&&-O2(QN"V->%R+(T
• )3D?)N Q0"&'R+(MMMS89:;T
• KJA"2-B(Q)*"&'R+(S))789:;T
nデータセット
• ["O-[3>-&1%O1
• [>--(^G>3D 7%B-3 S&E"%&2%&' H[7SI

実験結果
nマスクとフレームの割合の平均
• 9#:9L(_(]9#`9L
• `段階
n$ZSZF+([SJでKJモデルと同等
Dataset Metric Mask TCCDS EC CombCN 3DGated LGTSM
FaceFo
-rensics
MSE#
Curve 0.0031* 0.0022 0.0012 0.0008 0.0012
Object 0.0096* 0.0074 0.0047 0.0048 0.0053
BBox 0.0055 0.0019 0.0016 0.0018 0.0020
LPIPS#
Curve 0.0566* 0.0562 0.0483 0.0276 0.0352
Object 0.1240* 0.0761 0.1353 0.0743 0.0770
BBox 0.1260 0.0335 0.0708 0.0395 0.0432
FID#
Curve 1.281* 0.848 0.704 0.472 0.601
Object 1.107* 0.946 0.913 0.766 0.782
BBox 1.013 0.663 0.742 0.663 0.681
FVI
MSE#
Curve 0.0220* 0.0048 0.0021 0.0025 0.0028
Object 0.0110* 0.0075 0.0048 0.0056 0.0065
LPIPS#
Curve 0.2838* 0.1204 0.0795 0.0522 0.0569
Object 0.2595* 0.1398 0.2054 0.1078 0.1086
FID#
Curve 2.1052* 1.0334 0.7669 0.6096 0.6436
Object 1.2979* 1.0754 1.0908 0.9050 0.9323
Number of Parameters#
(Generator/Discriminator)
- 20M/6M 16M/- 36M/18M 12M/6M
Inference FPS" 0.05# 55 120 23 80
Table 1: Quantitative comparison with baseline models on the FaceForensics and FVI dataset
based on [2]. The results of FVI dataset are averaged of seven mask-to-frame ratios; detailed
results of each ratio could be found in the supplementary materials. *TCCDS failed on some
cases and the results are averaged of successful ones. #runs on CPU; others are on GPU.

実験結果
nKJA"2-Bと同様な綺麗さで復元できる
8 CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING
Figure 4: Visual comparison with the baselines on the FVI testing set with object-like masks.
Best viewed in color and zoom-in. See video.

実験結果
nKJ(O3&PをCF6に置き換える
• 性能がそれほど下がらず
• パラメータ数大幅に削減
n$-">&"?@-(1*%G2%&'を加える
• さらに精度向上
• KJとほぼ変わらない結果
3D
conv.
TSM
Learnable
shifting
Gated
conv.
GAN LPIPS# FID # Param. #
X X X 0.1209 1.034 36M+18M
X X 0.2048 1.303 12M*+6M
X X 0.1660 1.198 12M
X X X 0.1256 1.091 12M+6M
X X X X 0.1213 1.039 12M+6M
Table 2: Ablation study on the FVI dataset with object-like masks. Number of parameters

まとめ
n$ACF6の提案
nチャネルを隣接フレームにシフトすることを学習
nゲーティングフィルタによる領域の識別
n8J畳み込みで時間情報の処理が可能
nKJ畳み込みの手法と比べてKKLの計算量削減かつ高性能

文献紹介：Learnable Gated Temporal Shift Module for Free-form Video Inpainting

Recommandé

Recommandé

Contenu connexe

Similaire à 文献紹介：Learnable Gated Temporal Shift Module for Free-form Video Inpainting

Similaire à 文献紹介：Learnable Gated Temporal Shift Module for Free-form Video Inpainting (20)

Plus de Toru Tamaki

Plus de Toru Tamaki (20)

Dernier

Dernier (9)

文献紹介：Learnable Gated Temporal Shift Module for Free-form Video Inpainting