SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
!"#$%#&'"()#*"+(,"-./$#'(
0123*(4/+5'"(3/$(6"".(72+"/(
8%.#2%*2%9
!"#$%"&'()*"&'+(,*- !.($%.+(/."&#!%&'($--+(0%&123&(41.+(
567)89:;
橋口凌大(名工大玉木研)
英語論文紹介 898:<::<=
概要
n$-">&"?@-(A"2-B(C-DE3>"@(F*%G2(63B.@- H$ACF6Iの提案
• 学習可能なシフトモジュール
• 8Jで時間モデリング
nKJと同等な精度
• KKLのパラメータと計算量
nCF6AMNの提案
• 自由形式のインペインティング
のモデル性能を大幅に向上
2 CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING
Figure 1: Our model takes videos with free-form masks (first row) and fills in the missing
areas with proposed LGTSM to generate realistic completed results (second row) compared
to the original videos (third row). It could be applied to video editing tasks such as video
object removal, as shown in the first two columns. Best viewed in color and zoom-in. See
corresponding videos in the following links: object removal, free-form masks, and faces.
usually have high computational complexity and the missing area may not be found for
画像のインペインティング
nA"2%&'(O3&P3@.2%3& Q!.R+(S))789:;T
• マスクされた領域
• インペインティングされた領域
• マスクされてない領域
• これらの差を学習する
nUB'-#)3&&-O2(QN"V->%R+(S))789:;T
• エッジを生成
• エッジを条件として
画像を復元
represents x-axis, y-axis of output map, kh and
kernel size (e.g. 3 × 3), k!
h = kh−1
2 , k!
w =
Rkh×kw×C!
×C
represents convolutional filters,
RC
and Oy,x ∈ RC!
are inputs and outputs. For
he bias in convolution is ignored.
ation shows that for all spatial locations (y, x),
ters are applied to produce the output in vanilla
al layers. This makes sense for tasks such as im-
cation and object detection, where all pixels of
e are valid, to extract local features in a sliding-
hion. However, for image inpainting, the input
ed of both regions with valid pixels/features out-
nd invalid pixels/features (in shallow layers) or
pixels/features (in deep layers) in masked re-
causes ambiguity during training and leads to
cts such as color discrepancy, blurriness and ob-
responses during testing, as reported in [23].
partial convolution is proposed [23] which
asking and re-normalization step to make the
Feature
Binary Mask
Binary Mask Feature
Rule-based
Update
Rule-based
Update
Conv.
Conv.
Conv.
Output
Feature
Conv.
Output
Soft Gating
Conv.
Soft Gating
Conv.
Feature
Conv.
Conv.
Learnable Gating/Feature
Rule-based Binary Mask
Figure 2: Illustration of partial convolution (left) and gated
convolution (right).
updated with rules, gated convolutions learn soft mask au-
tomatically from data. It is formulated as:
Gatingy,x =
# #
Wg · I
Featurey,x =
# #
Wf · I
Figure 2: Summary of our proposed method. Incomplete grayscale image and edge map, and mask are the inputs of G1 to
predict the full edge map. Predicted edge map and incomplete color image are passed to G2 to perform the inpainting task.
動画のインペインティング
n)3D?)N Q0"&'R+(MMMS89:;T
• 時間方向に一貫性のある
動画を生成するためにKJ()NN
• 8J()NNでリファイン
n提案手法
• 時間情報を8次元で統合的に扱う
• $ACF6の提案
(a) 3DCN (b) CombCN
Final output
Input Frame
Input Video
(down-sampled
version)
Note: The gray blocks representing “Input Video” and “Input Frame” contains 4 channels, R/G/B and mask.
H/2 × W/2
H/4 × W/4
H/8 × W/8
H/4 × W/4
H/2 × W/2
Dilated 3D Conv.
H × W
H/2 × W/2
H/4 × W/4
H/2 × W/2
H × W
Dilated Conv.
+
+
Figure 2: Network architecture of our 3D completion network (3DCN) and 3D-2D combined completion network (CombCN).
The 3DCN works in low resolution, producing an inpainted video as output. Its individual frames are further convolved and
added into the first and last layer of the same size in CombCN. The input video for 3DCN and the input frame for CombCN,
shown as gray blocks, are in 4-channel format, containing RGB and the mask indicating the holes to be filled.
produces an complete video Vout as output. The incomplete
video is represented as a F ⇥ H ⇥ W volume, where F, H,
W are number of frames, height and width of Vin. M and
Vout are of the same size as Vin. In training stage, Vin is
obtained by randomly generating holes on a complete video
Vc, i.e. the target ground truth of the inpainting task.
To produce a spatial-temporal coherent inpainted video,
our network consists of two sub-networks. One network is a
3D completion network (3DCN), which utilizes 3D CNN to
infer the temporal structure globally from a down-sampled
version (F ⇥ H
r ⇥ W
r ) of Vin and M. In this paper, we use
superscript d to distinguish the downsampled videos from
their original versions. So 3DCN takes V d
in and Md
as in-
put and produces V d
out as output. Another sub-network is a
3D-2D combined completion network (CombCN). It applies
2D CNN to perform completion frame by frame. The input
of this network includes one incomplete but high-resolution
frame of Vin with its mask frame of M, and a low-resolution
but complete frame Id,k
out of V d
out (k = 1, 2, ..., F). In our pa-
per, F, H, W, r are set to 32, 128, 128 and 2 respectively.
plete video and its mask as input, it first exploits 4 strided
convolutional layers to encode it to a latent space, captur-
ing its temporal-spatial structure. Then 3 dilated convolu-
tional layers with rate 2, 4, 8 are followed to capture the
spatial-temporal information in larger perception field. At
last, video inpainting is finally achieved by 3 convolutional
and 2 fractionally-strided convolutional layers in an alterna-
tive order, yielding the result with missing part filled. Rather
than using the max-pooling and upsampling layers to com-
pute the feature maps, we employ 3 ⇥ 3 convolution ker-
nels with stride of 2, which ensures that every pixel con-
tributes. Meanwhile, considering that non-successive frames
may have loose relations with the current one, and to avoid
information loss across frames, we limit stride and dila-
tion to take effects only within frame rather than across
frames. As a result, the feature map of each layer has a con-
stant frame number F. Besides, all the convolutional lay-
ers are followed by batch normalization (BN) and ReLU
non-linearity activation except the last one. Paddings are in-
volved to make the input and output have exactly the same
size. The skip-connections as U-Net (Ronneberger, Fischer,
017
018
019
020
021
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
We use kernel size (5⇥5) for the first convolutional layer, kernel size (4⇥4) with stride 2 for
the down-sampling layers, kernel size (3 ⇥ 3) with dilation of 2, 4, 8, 16 for dilated layers
and kernel size (3 ⇥ 3) for other convolutional layers in the generator. For non-linearity
layers, we use LeakyReLU. We use nearest up-sampling and convolutions for up-sampling
layers. GTSM/LGTSM is added on all convolutional layers except for the first one, as it is
the original RGB input.
1.2 Training Procedure
Please see the Setups(Training and testing) section in the main paper. At the first stage in
training, we found applying only l1 loss, perceptual loss and style loss enhances the training
speed and stability. After the first stage converges, we add the TSMGAN to fine-tune. We
set ll1
= 1, lperc = 1, lstyle = 10 and lG = 1. We use Adam optimizer [3] with learning rate
0.0001.
c 2019. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
Output video Ground truth
Losses
Input video & masks (a) Video Inpainting Generator
Temporal
Cannel
(b) TSM Patch Discriminator
Real or fake for
each feature point
Adversarial loss
Output video
Ground truth
Learnable Gated TSM
Convolution (LGTSM)
Dilated
LGTSM
Gated 2D
Convolution
Learnable TSM
Convolution
Normal 2D
Convolution
!"#$%
n単純なシフトだとマスクされた領域がシフトされる可能性
n学習可能なシフトカーネル
CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING 3
0
0
1
*
1
0
0
*
Backward
Shifting
Kernel
Forward
Shifting
Kernel
a1
a2
a3
T
C
H, W
(a) Features (b) Naive TSM (c) TSM by Shifting Kernels (c) Learnable Kernels
b1
b2
b3
Backward Shift
Forward Shift
Zero padding Truncated
* Convolution
CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING 5
Element-wise Addition Element-wise Multiplication
Shift Conv
X Y
(a) Residual TSM
Conv
Gating
X Y
(b) Gated Convolution
Learnable Shift Conv
Gating
X Y
(d) Learnable Gated TSM (Ours)
Shift Conv
Gating Y
(c) Gated TSM (Ours)
X
Figure 3: Module design. We integrate (a) Residual TSM [14] and (b) gated convolution [28]
to (c) Gated Temporal Shift Module (GTSM) and design learnable temporal shifting kernel
(Fig. 2c) to make (d) the proposed Learnable Gated Temporal Shift Module (LGTSM).
Outputt,x,y = s(Gatingt,x,y)f(Featurest,x,y) (3)
where s is the sigmoid function that transforms gating to values between 0 (invalid) and
1 (valid), and f is the activation function for the convolution. Note that the TSM module
could be easily modified to online settings (without peeking future frames) for real-time
applications.
Note that the temporal shifting operation in TSM is similar to applying forward/backward
#$%"&'
nCF6AMN
n各空間#時間特徴点について本物か偽物かの分類を学習
n高レベルな特徴
n高レベルの特徴の@311を下げることで効率的なモデルの改善
nモデル全体の@311
Shit Module integrated on both the generator and discriminator as stated in 3.1.
he TSMGAN discriminator is composed of six 2D convolutional layers with TSM.
we apply the recently proposed spectral normalization [16] to both the generator and
minator as [17] to enhance the training stability. The TSMGAN loss LD for the dis-
nator to tell if the input video z is real or fake and LG for the generator to fool the
minator are defined as:
LD = Ex⇠Pdata(x)[ReLU(1+D(x))]
+Ez⇠Pz(z)[ReLU(1 D(G(z)))]
(7)
LG = Ez⇠Pz(z)[D(G(z))] (8)
s [28], the kernel size is 5 ⇥ 5, stride 2 ⇥ 2 and the shifting operation is applied to all
nvolutional layers for the TSMGAN discriminator, so the receptive field of each output
re point includes the whole video. It could be viewed as several GANs on different
re points, and a local-global discriminator structure [29] is thus not needed.
esides, the TSMGAN learns to classify real or fake for each spatial-temporal feature
from the last convolutional layer in the discriminator, which mostly consists of high-
features. Since the l1 loss already focuses on low-level features, using TSMGAN could
ove the model in an efficient way.
all loss. The overall loss function to train the model is defined as:
Ltotal = ll1
Ll1
+lpercLperc +lstyleLstyle +lGLG (9)
e ll1
, lperc, lstyle and lG are the weights for l1 loss, perceptual loss, style loss and
GAN loss, respectively.
TSMGAN loss. All the aforementioned loss functions are for image only, which do not
take temporal consistency into consideration. Therefore, we develop TSMGAN to learn
temporal consistency. We set up a generative adversarial network (GAN) with Gated Tem-
poral Shit Module integrated on both the generator and discriminator as stated in 3.1.
The TSMGAN discriminator is composed of six 2D convolutional layers with TSM.
Also, we apply the recently proposed spectral normalization [16] to both the generator and
discriminator as [17] to enhance the training stability. The TSMGAN loss LD for the dis-
criminator to tell if the input video z is real or fake and LG for the generator to fool the
discriminator are defined as:
LD = Ex⇠Pdata(x)[ReLU(1+D(x))]
+Ez⇠Pz(z)[ReLU(1 D(G(z)))]
(7)
LG = Ez⇠Pz(z)[D(G(z))] (8)
As [28], the kernel size is 5 ⇥ 5, stride 2 ⇥ 2 and the shifting operation is applied to all
11 convolutional layers for the TSMGAN discriminator, so the receptive field of each output
feature point includes the whole video. It could be viewed as several GANs on different
feature points, and a local-global discriminator structure [29] is thus not needed.
Besides, the TSMGAN learns to classify real or fake for each spatial-temporal feature
point from the last convolutional layer in the discriminator, which mostly consists of high-
level features. Since the l1 loss already focuses on low-level features, using TSMGAN could
improve the model in an efficient way.
Overall loss. The overall loss function to train the model is defined as:
Ltotal = ll1
Ll1
+lpercLperc +lstyleLstyle +lGLG (9)
where ll1
, lperc, lstyle and lG are the weights for l1 loss, perceptual loss, style loss and
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
speed and stability. After the first stage converges, we add the TSMGAN to fine-tune. We
set ll1
= 1, lperc = 1, lstyle = 10 and lG = 1. We use Adam optimizer [3] with learning rate
0.0001.
c 2019. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
Output video Ground truth
Losses
Input video & masks (a) Video Inpainting Generator
Temporal
Cannel
(b) TSM Patch Discriminator
Real or fake for
each feature point
Adversarial loss
Output video
Ground truth
Learnable Gated TSM
Convolution (LGTSM)
Dilated
LGTSM
Gated 2D
Convolution
Learnable TSM
Convolution
Normal 2D
Convolution
Figure 1: Overall model architecture.
実験設定
n$:(@311+(E->O-E2."@(@311+(12W@-(@311を学
習させる
n収束したのちにCF6AMNを加えてファイ
ンチューニング
n学習率
• 9X999:
n最適化
• MB"D
n損失の重み
The TSMGAN discriminator is composed of six 2D convolutional layers with TSM.
, we apply the recently proposed spectral normalization [16] to both the generator and
iminator as [17] to enhance the training stability. The TSMGAN loss LD for the dis-
inator to tell if the input video z is real or fake and LG for the generator to fool the
iminator are defined as:
LD = Ex⇠Pdata(x)[ReLU(1+D(x))]
+Ez⇠Pz(z)[ReLU(1 D(G(z)))]
(7)
LG = Ez⇠Pz(z)[D(G(z))] (8)
As [28], the kernel size is 5 ⇥ 5, stride 2 ⇥ 2 and the shifting operation is applied to all
onvolutional layers for the TSMGAN discriminator, so the receptive field of each output
re point includes the whole video. It could be viewed as several GANs on different
re points, and a local-global discriminator structure [29] is thus not needed.
Besides, the TSMGAN learns to classify real or fake for each spatial-temporal feature
t from the last convolutional layer in the discriminator, which mostly consists of high-
features. Since the l1 loss already focuses on low-level features, using TSMGAN could
ove the model in an efficient way.
rall loss. The overall loss function to train the model is defined as:
Ltotal = ll1
Ll1
+lpercLperc +lstyleLstyle +lGLG (9)
e ll1
, lperc, lstyle and lG are the weights for l1 loss, perceptual loss, style loss and
MGAN loss, respectively.
<latexit sha1_base64="0ko7SN6bm3Cb1YLrzxojUCpzCpw=">AAACsHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQt4x+QAFackxlfnGNYq2CoY6nDBRQpSi5LRxYpLKnNSwYIGSKLuYJF4AWUDPQMwUMBkGEIZygxQEJAvsJwhhiGFIZ8hmaGUIZchlSGPoQTIzmFIZCgGwmgGQwYDhgKgWCxDNVCsCMjKBMunMtQycAH1lgJVpQJVJAJFs4FkOpAXDRXNA/JBZhaDdScDbckB4iKgTgUGVYOrBisNPhucMFht8NLgD06zqsFmgNxSCaSTIHpTC+L5uySCvxPUlQukSxgyELrwurmEIY3BAuzWTKDbC8AiIF8kQ/SXVU3/HGwVpFqtZrDI4DXQ/QsNbhocBvogr+xL8tLA1KDZDFzACDBED25MRpiRnqGZnkmgibKDEzQqOBikGZQYNIDhbc7gwODBEMAQCrR3I8N1hgcMD5mMmCKY4pkSIUqZGKF6hBlQAFMWAC4rpDg=</latexit>
l1 = 1, perc = 1, style = 10, G = 1
022
023
024
025
026
027
028
029
030
031
032
033
034
035
036
037
038
039
040
041
042
043
044
045
the original RGB input.
1.2 Training Procedure
Please see the Setups(Training and testing) section in the main paper. At the first stage in
training, we found applying only l1 loss, perceptual loss and style loss enhances the training
speed and stability. After the first stage converges, we add the TSMGAN to fine-tune. We
set ll1
= 1, lperc = 1, lstyle = 10 and lG = 1. We use Adam optimizer [3] with learning rate
0.0001.
c 2019. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
Output video Ground truth
Losses
Input video & masks (a) Video Inpainting Generator
Temporal
Cannel
(b) TSM Patch Discriminator
Real or fake for
each feature point
Adversarial loss
Output video
Ground truth
Learnable Gated TSM
Convolution (LGTSM)
Dilated
LGTSM
Gated 2D
Convolution
Learnable TSM
Convolution
Normal 2D
Convolution
Figure 1: Overall model architecture.
実験設定
n画質
• 6-"&(FY.">-B(U>>3>(H6FUI
• $-">&-B(Z->O-E2."@(SD"'-(Z"2O*(
F%D%@">%2W(H$ZSZFI
nビデオ品質と時間的整合性
• [>-O*@-2 S&O-E2%3&(J%12"&O-(H[SJI
nベースライン
• C))JF(Q4."&'R+(M)6CA89:]T
• UB'-#)3&&-O2(QN"V->%R+(T
• )3D?)N Q0"&'R+(MMMS89:;T
• KJA"2-B(Q)*"&'R+(S))789:;T
nデータセット
• ["O-[3>-&1%O1
• [>--(^G>3D 7%B-3 S&E"%&2%&' H[7SI
実験結果
nマスクとフレームの割合の平均
• 9#:9L(_(]9#`9L
• `段階
n$ZSZF+([SJでKJモデルと同等
CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING 7
Dataset Metric Mask TCCDS EC CombCN 3DGated LGTSM
FaceFo
-rensics
MSE#
Curve 0.0031* 0.0022 0.0012 0.0008 0.0012
Object 0.0096* 0.0074 0.0047 0.0048 0.0053
BBox 0.0055 0.0019 0.0016 0.0018 0.0020
LPIPS#
Curve 0.0566* 0.0562 0.0483 0.0276 0.0352
Object 0.1240* 0.0761 0.1353 0.0743 0.0770
BBox 0.1260 0.0335 0.0708 0.0395 0.0432
FID#
Curve 1.281* 0.848 0.704 0.472 0.601
Object 1.107* 0.946 0.913 0.766 0.782
BBox 1.013 0.663 0.742 0.663 0.681
FVI
MSE#
Curve 0.0220* 0.0048 0.0021 0.0025 0.0028
Object 0.0110* 0.0075 0.0048 0.0056 0.0065
LPIPS#
Curve 0.2838* 0.1204 0.0795 0.0522 0.0569
Object 0.2595* 0.1398 0.2054 0.1078 0.1086
FID#
Curve 2.1052* 1.0334 0.7669 0.6096 0.6436
Object 1.2979* 1.0754 1.0908 0.9050 0.9323
Number of Parameters#
(Generator/Discriminator)
- 20M/6M 16M/- 36M/18M 12M/6M
Inference FPS" 0.05# 55 120 23 80
Table 1: Quantitative comparison with baseline models on the FaceForensics and FVI dataset
based on [2]. The results of FVI dataset are averaged of seven mask-to-frame ratios; detailed
results of each ratio could be found in the supplementary materials. *TCCDS failed on some
cases and the results are averaged of successful ones. #runs on CPU; others are on GPU.
実験結果
nKJA"2-Bと同様な綺麗さで復元できる
8 CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING
Figure 4: Visual comparison with the baselines on the FVI testing set with object-like masks.
Best viewed in color and zoom-in. See video.
実験結果
nKJ(O3&PをCF6に置き換える
• 性能がそれほど下がらず
• パラメータ数大幅に削減
n$-">&"?@-(1*%G2%&'を加える
• さらに精度向上
• KJとほぼ変わらない結果
CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING 9
3D
conv.
TSM
Learnable
shifting
Gated
conv.
GAN LPIPS# FID # Param. #
X X X 0.1209 1.034 36M+18M
X X 0.2048 1.303 12M*+6M
X X 0.1660 1.198 12M
X X X 0.1256 1.091 12M+6M
X X X X 0.1213 1.039 12M+6M
Table 2: Ablation study on the FVI dataset with object-like masks. Number of parameters
まとめ
n$ACF6の提案
nチャネルを隣接フレームにシフトすることを学習
nゲーティングフィルタによる領域の識別
n8J畳み込みで時間情報の処理が可能
nKJ畳み込みの手法と比べてKKLの計算量削減かつ高性能

Contenu connexe

Similaire à 文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting

文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...Toru Tamaki
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...Toru Tamaki
 
(文献紹介)深層学習による動被写体ロバストなカメラの動き推定
(文献紹介)深層学習による動被写体ロバストなカメラの動き推定(文献紹介)深層学習による動被写体ロバストなカメラの動き推定
(文献紹介)深層学習による動被写体ロバストなカメラの動き推定Morpho, Inc.
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)RCCSRENKEI
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習Masahiro Suzuki
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video RecognitionToru Tamaki
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video ClassificationToru Tamaki
 
ICCV2019読み会「Learning Meshes for Dense Visual SLAM」
ICCV2019読み会「Learning Meshes for Dense Visual SLAM」ICCV2019読み会「Learning Meshes for Dense Visual SLAM」
ICCV2019読み会「Learning Meshes for Dense Visual SLAM」Sho Kagami
 
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action RecognitionToru Tamaki
 
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust) GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust) 智啓 出川
 
BA-Net: Dense Bundle Adjustment Network (3D勉強会@関東)
BA-Net: Dense Bundle Adjustment Network (3D勉強会@関東) BA-Net: Dense Bundle Adjustment Network (3D勉強会@関東)
BA-Net: Dense Bundle Adjustment Network (3D勉強会@関東) Mai Nishimura
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Yusuke Uchida
 
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜Michiharu Niimi
 
GPGPU Seminar (GPU Accelerated Libraries, 2 of 3, cuSPARSE)
GPGPU Seminar (GPU Accelerated Libraries, 2 of 3, cuSPARSE) GPGPU Seminar (GPU Accelerated Libraries, 2 of 3, cuSPARSE)
GPGPU Seminar (GPU Accelerated Libraries, 2 of 3, cuSPARSE) 智啓 出川
 
Bee Style:vol.007
Bee Style:vol.007Bee Style:vol.007
Bee Style:vol.007spicepark
 
Light weightbinocular sigasia2012_face
Light weightbinocular sigasia2012_faceLight weightbinocular sigasia2012_face
Light weightbinocular sigasia2012_faceishii yasunori
 
ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)Hideki Okada
 

Similaire à 文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting (20)

TVM の紹介
TVM の紹介TVM の紹介
TVM の紹介
 
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
文献紹介:Extreme Low-Resolution Activity Recognition Using a Super-Resolution-Ori...
 
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
文献紹介:SegFormer: Simple and Efficient Design for Semantic Segmentation with Tr...
 
(文献紹介)深層学習による動被写体ロバストなカメラの動き推定
(文献紹介)深層学習による動被写体ロバストなカメラの動き推定(文献紹介)深層学習による動被写体ロバストなカメラの動き推定
(文献紹介)深層学習による動被写体ロバストなカメラの動き推定
 
第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)第11回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition
 
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
 
ICCV2019読み会「Learning Meshes for Dense Visual SLAM」
ICCV2019読み会「Learning Meshes for Dense Visual SLAM」ICCV2019読み会「Learning Meshes for Dense Visual SLAM」
ICCV2019読み会「Learning Meshes for Dense Visual SLAM」
 
optimal Ate pairing
optimal Ate pairingoptimal Ate pairing
optimal Ate pairing
 
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
 
画像処理の高性能計算
画像処理の高性能計算画像処理の高性能計算
画像処理の高性能計算
 
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust) GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
GPGPU Seminar (GPU Accelerated Libraries, 3 of 3, Thrust)
 
BA-Net: Dense Bundle Adjustment Network (3D勉強会@関東)
BA-Net: Dense Bundle Adjustment Network (3D勉強会@関東) BA-Net: Dense Bundle Adjustment Network (3D勉強会@関東)
BA-Net: Dense Bundle Adjustment Network (3D勉強会@関東)
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
 
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
[チュートリアル講演]画像データを対象とする情報ハイディング〜JPEG画像を利用したハイディング〜
 
GPGPU Seminar (GPU Accelerated Libraries, 2 of 3, cuSPARSE)
GPGPU Seminar (GPU Accelerated Libraries, 2 of 3, cuSPARSE) GPGPU Seminar (GPU Accelerated Libraries, 2 of 3, cuSPARSE)
GPGPU Seminar (GPU Accelerated Libraries, 2 of 3, cuSPARSE)
 
Bee Style:vol.007
Bee Style:vol.007Bee Style:vol.007
Bee Style:vol.007
 
Light weightbinocular sigasia2012_face
Light weightbinocular sigasia2012_faceLight weightbinocular sigasia2012_face
Light weightbinocular sigasia2012_face
 
ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)ICCV 2019 論文紹介 (26 papers)
ICCV 2019 論文紹介 (26 papers)
 

Plus de Toru Tamaki

論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt TuningToru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in MoviesToru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICAToru Tamaki
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context RefinementToru Tamaki
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...Toru Tamaki
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...Toru Tamaki
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusionToru Tamaki
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous DrivingToru Tamaki
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large MotionToru Tamaki
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense PredictionsToru Tamaki
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understandingToru Tamaki
 
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation LearningToru Tamaki
 
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image CaptioningToru Tamaki
 
論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models
論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models
論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language ModelsToru Tamaki
 
論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video RetrievalToru Tamaki
 

Plus de Toru Tamaki (20)

論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
 
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
 
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
論文紹介:Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning
 
論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models
論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models
論文紹介:ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models
 
論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval
 

Dernier

新人研修のまとめ 2024/04/12の勉強会で発表されたものです。
新人研修のまとめ       2024/04/12の勉強会で発表されたものです。新人研修のまとめ       2024/04/12の勉強会で発表されたものです。
新人研修のまとめ 2024/04/12の勉強会で発表されたものです。iPride Co., Ltd.
 
スマートフォンを用いた新生児あやし動作の教示システム
スマートフォンを用いた新生児あやし動作の教示システムスマートフォンを用いた新生児あやし動作の教示システム
スマートフォンを用いた新生児あやし動作の教示システムsugiuralab
 
PHP-Conference-Odawara-2024-04-000000000
PHP-Conference-Odawara-2024-04-000000000PHP-Conference-Odawara-2024-04-000000000
PHP-Conference-Odawara-2024-04-000000000Shota Ito
 
UPWARD_share_company_information_20240415.pdf
UPWARD_share_company_information_20240415.pdfUPWARD_share_company_information_20240415.pdf
UPWARD_share_company_information_20240415.pdffurutsuka
 
Postman LT Fukuoka_Quick Prototype_By Daniel
Postman LT Fukuoka_Quick Prototype_By DanielPostman LT Fukuoka_Quick Prototype_By Daniel
Postman LT Fukuoka_Quick Prototype_By Danieldanielhu54
 
[DevOpsDays Tokyo 2024] 〜デジタルとアナログのはざまに〜 スマートビルディング爆速開発を支える 自動化テスト戦略
[DevOpsDays Tokyo 2024] 〜デジタルとアナログのはざまに〜 スマートビルディング爆速開発を支える 自動化テスト戦略[DevOpsDays Tokyo 2024] 〜デジタルとアナログのはざまに〜 スマートビルディング爆速開発を支える 自動化テスト戦略
[DevOpsDays Tokyo 2024] 〜デジタルとアナログのはざまに〜 スマートビルディング爆速開発を支える 自動化テスト戦略Ryo Sasaki
 
20240412_HCCJP での Windows Server 2025 Active Directory
20240412_HCCJP での Windows Server 2025 Active Directory20240412_HCCJP での Windows Server 2025 Active Directory
20240412_HCCJP での Windows Server 2025 Active Directoryosamut
 
IoT in the era of generative AI, Thanks IoT ALGYAN.pptx
IoT in the era of generative AI, Thanks IoT ALGYAN.pptxIoT in the era of generative AI, Thanks IoT ALGYAN.pptx
IoT in the era of generative AI, Thanks IoT ALGYAN.pptxAtomu Hidaka
 
Amazon SES を勉強してみる その12024/04/12の勉強会で発表されたものです。
Amazon SES を勉強してみる その12024/04/12の勉強会で発表されたものです。Amazon SES を勉強してみる その12024/04/12の勉強会で発表されたものです。
Amazon SES を勉強してみる その12024/04/12の勉強会で発表されたものです。iPride Co., Ltd.
 

Dernier (9)

新人研修のまとめ 2024/04/12の勉強会で発表されたものです。
新人研修のまとめ       2024/04/12の勉強会で発表されたものです。新人研修のまとめ       2024/04/12の勉強会で発表されたものです。
新人研修のまとめ 2024/04/12の勉強会で発表されたものです。
 
スマートフォンを用いた新生児あやし動作の教示システム
スマートフォンを用いた新生児あやし動作の教示システムスマートフォンを用いた新生児あやし動作の教示システム
スマートフォンを用いた新生児あやし動作の教示システム
 
PHP-Conference-Odawara-2024-04-000000000
PHP-Conference-Odawara-2024-04-000000000PHP-Conference-Odawara-2024-04-000000000
PHP-Conference-Odawara-2024-04-000000000
 
UPWARD_share_company_information_20240415.pdf
UPWARD_share_company_information_20240415.pdfUPWARD_share_company_information_20240415.pdf
UPWARD_share_company_information_20240415.pdf
 
Postman LT Fukuoka_Quick Prototype_By Daniel
Postman LT Fukuoka_Quick Prototype_By DanielPostman LT Fukuoka_Quick Prototype_By Daniel
Postman LT Fukuoka_Quick Prototype_By Daniel
 
[DevOpsDays Tokyo 2024] 〜デジタルとアナログのはざまに〜 スマートビルディング爆速開発を支える 自動化テスト戦略
[DevOpsDays Tokyo 2024] 〜デジタルとアナログのはざまに〜 スマートビルディング爆速開発を支える 自動化テスト戦略[DevOpsDays Tokyo 2024] 〜デジタルとアナログのはざまに〜 スマートビルディング爆速開発を支える 自動化テスト戦略
[DevOpsDays Tokyo 2024] 〜デジタルとアナログのはざまに〜 スマートビルディング爆速開発を支える 自動化テスト戦略
 
20240412_HCCJP での Windows Server 2025 Active Directory
20240412_HCCJP での Windows Server 2025 Active Directory20240412_HCCJP での Windows Server 2025 Active Directory
20240412_HCCJP での Windows Server 2025 Active Directory
 
IoT in the era of generative AI, Thanks IoT ALGYAN.pptx
IoT in the era of generative AI, Thanks IoT ALGYAN.pptxIoT in the era of generative AI, Thanks IoT ALGYAN.pptx
IoT in the era of generative AI, Thanks IoT ALGYAN.pptx
 
Amazon SES を勉強してみる その12024/04/12の勉強会で発表されたものです。
Amazon SES を勉強してみる その12024/04/12の勉強会で発表されたものです。Amazon SES を勉強してみる その12024/04/12の勉強会で発表されたものです。
Amazon SES を勉強してみる その12024/04/12の勉強会で発表されたものです。
 

文献紹介:Learnable Gated Temporal Shift Module for Free-form Video Inpainting

  • 2. 概要 n$-">&"?@-(A"2-B(C-DE3>"@(F*%G2(63B.@- H$ACF6Iの提案 • 学習可能なシフトモジュール • 8Jで時間モデリング nKJと同等な精度 • KKLのパラメータと計算量 nCF6AMNの提案 • 自由形式のインペインティング のモデル性能を大幅に向上 2 CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING Figure 1: Our model takes videos with free-form masks (first row) and fills in the missing areas with proposed LGTSM to generate realistic completed results (second row) compared to the original videos (third row). It could be applied to video editing tasks such as video object removal, as shown in the first two columns. Best viewed in color and zoom-in. See corresponding videos in the following links: object removal, free-form masks, and faces. usually have high computational complexity and the missing area may not be found for
  • 3. 画像のインペインティング nA"2%&'(O3&P3@.2%3& Q!.R+(S))789:;T • マスクされた領域 • インペインティングされた領域 • マスクされてない領域 • これらの差を学習する nUB'-#)3&&-O2(QN"V->%R+(S))789:;T • エッジを生成 • エッジを条件として 画像を復元 represents x-axis, y-axis of output map, kh and kernel size (e.g. 3 × 3), k! h = kh−1 2 , k! w = Rkh×kw×C! ×C represents convolutional filters, RC and Oy,x ∈ RC! are inputs and outputs. For he bias in convolution is ignored. ation shows that for all spatial locations (y, x), ters are applied to produce the output in vanilla al layers. This makes sense for tasks such as im- cation and object detection, where all pixels of e are valid, to extract local features in a sliding- hion. However, for image inpainting, the input ed of both regions with valid pixels/features out- nd invalid pixels/features (in shallow layers) or pixels/features (in deep layers) in masked re- causes ambiguity during training and leads to cts such as color discrepancy, blurriness and ob- responses during testing, as reported in [23]. partial convolution is proposed [23] which asking and re-normalization step to make the Feature Binary Mask Binary Mask Feature Rule-based Update Rule-based Update Conv. Conv. Conv. Output Feature Conv. Output Soft Gating Conv. Soft Gating Conv. Feature Conv. Conv. Learnable Gating/Feature Rule-based Binary Mask Figure 2: Illustration of partial convolution (left) and gated convolution (right). updated with rules, gated convolutions learn soft mask au- tomatically from data. It is formulated as: Gatingy,x = # # Wg · I Featurey,x = # # Wf · I Figure 2: Summary of our proposed method. Incomplete grayscale image and edge map, and mask are the inputs of G1 to predict the full edge map. Predicted edge map and incomplete color image are passed to G2 to perform the inpainting task.
  • 4. 動画のインペインティング n)3D?)N Q0"&'R+(MMMS89:;T • 時間方向に一貫性のある 動画を生成するためにKJ()NN • 8J()NNでリファイン n提案手法 • 時間情報を8次元で統合的に扱う • $ACF6の提案 (a) 3DCN (b) CombCN Final output Input Frame Input Video (down-sampled version) Note: The gray blocks representing “Input Video” and “Input Frame” contains 4 channels, R/G/B and mask. H/2 × W/2 H/4 × W/4 H/8 × W/8 H/4 × W/4 H/2 × W/2 Dilated 3D Conv. H × W H/2 × W/2 H/4 × W/4 H/2 × W/2 H × W Dilated Conv. + + Figure 2: Network architecture of our 3D completion network (3DCN) and 3D-2D combined completion network (CombCN). The 3DCN works in low resolution, producing an inpainted video as output. Its individual frames are further convolved and added into the first and last layer of the same size in CombCN. The input video for 3DCN and the input frame for CombCN, shown as gray blocks, are in 4-channel format, containing RGB and the mask indicating the holes to be filled. produces an complete video Vout as output. The incomplete video is represented as a F ⇥ H ⇥ W volume, where F, H, W are number of frames, height and width of Vin. M and Vout are of the same size as Vin. In training stage, Vin is obtained by randomly generating holes on a complete video Vc, i.e. the target ground truth of the inpainting task. To produce a spatial-temporal coherent inpainted video, our network consists of two sub-networks. One network is a 3D completion network (3DCN), which utilizes 3D CNN to infer the temporal structure globally from a down-sampled version (F ⇥ H r ⇥ W r ) of Vin and M. In this paper, we use superscript d to distinguish the downsampled videos from their original versions. So 3DCN takes V d in and Md as in- put and produces V d out as output. Another sub-network is a 3D-2D combined completion network (CombCN). It applies 2D CNN to perform completion frame by frame. The input of this network includes one incomplete but high-resolution frame of Vin with its mask frame of M, and a low-resolution but complete frame Id,k out of V d out (k = 1, 2, ..., F). In our pa- per, F, H, W, r are set to 32, 128, 128 and 2 respectively. plete video and its mask as input, it first exploits 4 strided convolutional layers to encode it to a latent space, captur- ing its temporal-spatial structure. Then 3 dilated convolu- tional layers with rate 2, 4, 8 are followed to capture the spatial-temporal information in larger perception field. At last, video inpainting is finally achieved by 3 convolutional and 2 fractionally-strided convolutional layers in an alterna- tive order, yielding the result with missing part filled. Rather than using the max-pooling and upsampling layers to com- pute the feature maps, we employ 3 ⇥ 3 convolution ker- nels with stride of 2, which ensures that every pixel con- tributes. Meanwhile, considering that non-successive frames may have loose relations with the current one, and to avoid information loss across frames, we limit stride and dila- tion to take effects only within frame rather than across frames. As a result, the feature map of each layer has a con- stant frame number F. Besides, all the convolutional lay- ers are followed by batch normalization (BN) and ReLU non-linearity activation except the last one. Paddings are in- volved to make the input and output have exactly the same size. The skip-connections as U-Net (Ronneberger, Fischer, 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 We use kernel size (5⇥5) for the first convolutional layer, kernel size (4⇥4) with stride 2 for the down-sampling layers, kernel size (3 ⇥ 3) with dilation of 2, 4, 8, 16 for dilated layers and kernel size (3 ⇥ 3) for other convolutional layers in the generator. For non-linearity layers, we use LeakyReLU. We use nearest up-sampling and convolutions for up-sampling layers. GTSM/LGTSM is added on all convolutional layers except for the first one, as it is the original RGB input. 1.2 Training Procedure Please see the Setups(Training and testing) section in the main paper. At the first stage in training, we found applying only l1 loss, perceptual loss and style loss enhances the training speed and stability. After the first stage converges, we add the TSMGAN to fine-tune. We set ll1 = 1, lperc = 1, lstyle = 10 and lG = 1. We use Adam optimizer [3] with learning rate 0.0001. c 2019. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. Output video Ground truth Losses Input video & masks (a) Video Inpainting Generator Temporal Cannel (b) TSM Patch Discriminator Real or fake for each feature point Adversarial loss Output video Ground truth Learnable Gated TSM Convolution (LGTSM) Dilated LGTSM Gated 2D Convolution Learnable TSM Convolution Normal 2D Convolution
  • 5. !"#$% n単純なシフトだとマスクされた領域がシフトされる可能性 n学習可能なシフトカーネル CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING 3 0 0 1 * 1 0 0 * Backward Shifting Kernel Forward Shifting Kernel a1 a2 a3 T C H, W (a) Features (b) Naive TSM (c) TSM by Shifting Kernels (c) Learnable Kernels b1 b2 b3 Backward Shift Forward Shift Zero padding Truncated * Convolution CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING 5 Element-wise Addition Element-wise Multiplication Shift Conv X Y (a) Residual TSM Conv Gating X Y (b) Gated Convolution Learnable Shift Conv Gating X Y (d) Learnable Gated TSM (Ours) Shift Conv Gating Y (c) Gated TSM (Ours) X Figure 3: Module design. We integrate (a) Residual TSM [14] and (b) gated convolution [28] to (c) Gated Temporal Shift Module (GTSM) and design learnable temporal shifting kernel (Fig. 2c) to make (d) the proposed Learnable Gated Temporal Shift Module (LGTSM). Outputt,x,y = s(Gatingt,x,y)f(Featurest,x,y) (3) where s is the sigmoid function that transforms gating to values between 0 (invalid) and 1 (valid), and f is the activation function for the convolution. Note that the TSM module could be easily modified to online settings (without peeking future frames) for real-time applications. Note that the temporal shifting operation in TSM is similar to applying forward/backward
  • 6. #$%"&' nCF6AMN n各空間#時間特徴点について本物か偽物かの分類を学習 n高レベルな特徴 n高レベルの特徴の@311を下げることで効率的なモデルの改善 nモデル全体の@311 Shit Module integrated on both the generator and discriminator as stated in 3.1. he TSMGAN discriminator is composed of six 2D convolutional layers with TSM. we apply the recently proposed spectral normalization [16] to both the generator and minator as [17] to enhance the training stability. The TSMGAN loss LD for the dis- nator to tell if the input video z is real or fake and LG for the generator to fool the minator are defined as: LD = Ex⇠Pdata(x)[ReLU(1+D(x))] +Ez⇠Pz(z)[ReLU(1 D(G(z)))] (7) LG = Ez⇠Pz(z)[D(G(z))] (8) s [28], the kernel size is 5 ⇥ 5, stride 2 ⇥ 2 and the shifting operation is applied to all nvolutional layers for the TSMGAN discriminator, so the receptive field of each output re point includes the whole video. It could be viewed as several GANs on different re points, and a local-global discriminator structure [29] is thus not needed. esides, the TSMGAN learns to classify real or fake for each spatial-temporal feature from the last convolutional layer in the discriminator, which mostly consists of high- features. Since the l1 loss already focuses on low-level features, using TSMGAN could ove the model in an efficient way. all loss. The overall loss function to train the model is defined as: Ltotal = ll1 Ll1 +lpercLperc +lstyleLstyle +lGLG (9) e ll1 , lperc, lstyle and lG are the weights for l1 loss, perceptual loss, style loss and GAN loss, respectively. TSMGAN loss. All the aforementioned loss functions are for image only, which do not take temporal consistency into consideration. Therefore, we develop TSMGAN to learn temporal consistency. We set up a generative adversarial network (GAN) with Gated Tem- poral Shit Module integrated on both the generator and discriminator as stated in 3.1. The TSMGAN discriminator is composed of six 2D convolutional layers with TSM. Also, we apply the recently proposed spectral normalization [16] to both the generator and discriminator as [17] to enhance the training stability. The TSMGAN loss LD for the dis- criminator to tell if the input video z is real or fake and LG for the generator to fool the discriminator are defined as: LD = Ex⇠Pdata(x)[ReLU(1+D(x))] +Ez⇠Pz(z)[ReLU(1 D(G(z)))] (7) LG = Ez⇠Pz(z)[D(G(z))] (8) As [28], the kernel size is 5 ⇥ 5, stride 2 ⇥ 2 and the shifting operation is applied to all 11 convolutional layers for the TSMGAN discriminator, so the receptive field of each output feature point includes the whole video. It could be viewed as several GANs on different feature points, and a local-global discriminator structure [29] is thus not needed. Besides, the TSMGAN learns to classify real or fake for each spatial-temporal feature point from the last convolutional layer in the discriminator, which mostly consists of high- level features. Since the l1 loss already focuses on low-level features, using TSMGAN could improve the model in an efficient way. Overall loss. The overall loss function to train the model is defined as: Ltotal = ll1 Ll1 +lpercLperc +lstyleLstyle +lGLG (9) where ll1 , lperc, lstyle and lG are the weights for l1 loss, perceptual loss, style loss and 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 speed and stability. After the first stage converges, we add the TSMGAN to fine-tune. We set ll1 = 1, lperc = 1, lstyle = 10 and lG = 1. We use Adam optimizer [3] with learning rate 0.0001. c 2019. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. Output video Ground truth Losses Input video & masks (a) Video Inpainting Generator Temporal Cannel (b) TSM Patch Discriminator Real or fake for each feature point Adversarial loss Output video Ground truth Learnable Gated TSM Convolution (LGTSM) Dilated LGTSM Gated 2D Convolution Learnable TSM Convolution Normal 2D Convolution Figure 1: Overall model architecture.
  • 7. 実験設定 n$:(@311+(E->O-E2."@(@311+(12W@-(@311を学 習させる n収束したのちにCF6AMNを加えてファイ ンチューニング n学習率 • 9X999: n最適化 • MB"D n損失の重み The TSMGAN discriminator is composed of six 2D convolutional layers with TSM. , we apply the recently proposed spectral normalization [16] to both the generator and iminator as [17] to enhance the training stability. The TSMGAN loss LD for the dis- inator to tell if the input video z is real or fake and LG for the generator to fool the iminator are defined as: LD = Ex⇠Pdata(x)[ReLU(1+D(x))] +Ez⇠Pz(z)[ReLU(1 D(G(z)))] (7) LG = Ez⇠Pz(z)[D(G(z))] (8) As [28], the kernel size is 5 ⇥ 5, stride 2 ⇥ 2 and the shifting operation is applied to all onvolutional layers for the TSMGAN discriminator, so the receptive field of each output re point includes the whole video. It could be viewed as several GANs on different re points, and a local-global discriminator structure [29] is thus not needed. Besides, the TSMGAN learns to classify real or fake for each spatial-temporal feature t from the last convolutional layer in the discriminator, which mostly consists of high- features. Since the l1 loss already focuses on low-level features, using TSMGAN could ove the model in an efficient way. rall loss. The overall loss function to train the model is defined as: Ltotal = ll1 Ll1 +lpercLperc +lstyleLstyle +lGLG (9) e ll1 , lperc, lstyle and lG are the weights for l1 loss, perceptual loss, style loss and MGAN loss, respectively. <latexit sha1_base64="0ko7SN6bm3Cb1YLrzxojUCpzCpw=">AAACsHicSyrIySwuMTC4ycjEzMLKxs7BycXNw8vHLyAoFFacX1qUnBqanJ+TXxSRlFicmpOZlxpaklmSkxpRUJSamJuUkxqelO0Mkg8vSy0qzszPCympLEiNzU1Mz8tMy0xOLAEKxQt4x+QAFackxlfnGNYq2CoY6nDBRQpSi5LRxYpLKnNSwYIGSKLuYJF4AWUDPQMwUMBkGEIZygxQEJAvsJwhhiGFIZ8hmaGUIZchlSGPoQTIzmFIZCgGwmgGQwYDhgKgWCxDNVCsCMjKBMunMtQycAH1lgJVpQJVJAJFs4FkOpAXDRXNA/JBZhaDdScDbckB4iKgTgUGVYOrBisNPhucMFht8NLgD06zqsFmgNxSCaSTIHpTC+L5uySCvxPUlQukSxgyELrwurmEIY3BAuzWTKDbC8AiIF8kQ/SXVU3/HGwVpFqtZrDI4DXQ/QsNbhocBvogr+xL8tLA1KDZDFzACDBED25MRpiRnqGZnkmgibKDEzQqOBikGZQYNIDhbc7gwODBEMAQCrR3I8N1hgcMD5mMmCKY4pkSIUqZGKF6hBlQAFMWAC4rpDg=</latexit> l1 = 1, perc = 1, style = 10, G = 1 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 the original RGB input. 1.2 Training Procedure Please see the Setups(Training and testing) section in the main paper. At the first stage in training, we found applying only l1 loss, perceptual loss and style loss enhances the training speed and stability. After the first stage converges, we add the TSMGAN to fine-tune. We set ll1 = 1, lperc = 1, lstyle = 10 and lG = 1. We use Adam optimizer [3] with learning rate 0.0001. c 2019. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. Output video Ground truth Losses Input video & masks (a) Video Inpainting Generator Temporal Cannel (b) TSM Patch Discriminator Real or fake for each feature point Adversarial loss Output video Ground truth Learnable Gated TSM Convolution (LGTSM) Dilated LGTSM Gated 2D Convolution Learnable TSM Convolution Normal 2D Convolution Figure 1: Overall model architecture.
  • 8. 実験設定 n画質 • 6-"&(FY.">-B(U>>3>(H6FUI • $-">&-B(Z->O-E2."@(SD"'-(Z"2O*( F%D%@">%2W(H$ZSZFI nビデオ品質と時間的整合性 • [>-O*@-2 S&O-E2%3&(J%12"&O-(H[SJI nベースライン • C))JF(Q4."&'R+(M)6CA89:]T • UB'-#)3&&-O2(QN"V->%R+(T • )3D?)N Q0"&'R+(MMMS89:;T • KJA"2-B(Q)*"&'R+(S))789:;T nデータセット • ["O-[3>-&1%O1 • [>--(^G>3D 7%B-3 S&E"%&2%&' H[7SI
  • 9. 実験結果 nマスクとフレームの割合の平均 • 9#:9L(_(]9#`9L • `段階 n$ZSZF+([SJでKJモデルと同等 CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING 7 Dataset Metric Mask TCCDS EC CombCN 3DGated LGTSM FaceFo -rensics MSE# Curve 0.0031* 0.0022 0.0012 0.0008 0.0012 Object 0.0096* 0.0074 0.0047 0.0048 0.0053 BBox 0.0055 0.0019 0.0016 0.0018 0.0020 LPIPS# Curve 0.0566* 0.0562 0.0483 0.0276 0.0352 Object 0.1240* 0.0761 0.1353 0.0743 0.0770 BBox 0.1260 0.0335 0.0708 0.0395 0.0432 FID# Curve 1.281* 0.848 0.704 0.472 0.601 Object 1.107* 0.946 0.913 0.766 0.782 BBox 1.013 0.663 0.742 0.663 0.681 FVI MSE# Curve 0.0220* 0.0048 0.0021 0.0025 0.0028 Object 0.0110* 0.0075 0.0048 0.0056 0.0065 LPIPS# Curve 0.2838* 0.1204 0.0795 0.0522 0.0569 Object 0.2595* 0.1398 0.2054 0.1078 0.1086 FID# Curve 2.1052* 1.0334 0.7669 0.6096 0.6436 Object 1.2979* 1.0754 1.0908 0.9050 0.9323 Number of Parameters# (Generator/Discriminator) - 20M/6M 16M/- 36M/18M 12M/6M Inference FPS" 0.05# 55 120 23 80 Table 1: Quantitative comparison with baseline models on the FaceForensics and FVI dataset based on [2]. The results of FVI dataset are averaged of seven mask-to-frame ratios; detailed results of each ratio could be found in the supplementary materials. *TCCDS failed on some cases and the results are averaged of successful ones. #runs on CPU; others are on GPU.
  • 10. 実験結果 nKJA"2-Bと同様な綺麗さで復元できる 8 CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING Figure 4: Visual comparison with the baselines on the FVI testing set with object-like masks. Best viewed in color and zoom-in. See video.
  • 11. 実験結果 nKJ(O3&PをCF6に置き換える • 性能がそれほど下がらず • パラメータ数大幅に削減 n$-">&"?@-(1*%G2%&'を加える • さらに精度向上 • KJとほぼ変わらない結果 CHANG, LIU, LEE AND HSU: LGTSM FOR DEEP VIDEO INPAINTING 9 3D conv. TSM Learnable shifting Gated conv. GAN LPIPS# FID # Param. # X X X 0.1209 1.034 36M+18M X X 0.2048 1.303 12M*+6M X X 0.1660 1.198 12M X X X 0.1256 1.091 12M+6M X X X X 0.1213 1.039 12M+6M Table 2: Ablation study on the FVI dataset with object-like masks. Number of parameters