Neural Machine Translation via Binary Code Prediction

Neural Machine Translation
via Binary Code Prediction
Yusuke Oda (1)
Philip Arthur (1)
Graham Neubig (2, 1)
Koichiro Yoshino (1, 3)
Satoshi Nakamura (1)
(1) Nara Institute of Science and Technology
(2) Carnegie Mellon University
(3) Japan Science and Technology Agency

2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 2
Motivation of This Work
● Neural machine translation models tends to be HEAVY
– Due to softmax that requires O(V) matrix multiplication
● Reduce computation amount of the output layer
HEAVY.
Let's reduce it.

Requirements of Output Layers
● Less memory allow to load the model
on resource-restricted devices.Memory
● Less computation allow to run the model
not on expensive processors.Speed
● Model parallelization is also important
to keep fast training.Parallelism

Proposed: Binary Code Prediction
● Predicts bits of word ID numbers instead of label.
Softmax
O(V)
Binary Code Prediction
O(log V)
h
01 10
arg max word ID
h
Outputs
predicted IDs word ID
Loss: cross entropy
(or equivalent one)
Loss: bit-wise errors
(used squared loss in experiments)
Softmax Sigmoid

Comparison of Word Prediction Methods
Methods Memory
Speed
（Train）
Speed
（Test）
Parallelism
Hierarchical Softmax
(Morin&Bengio, 2005)
O(V) O(logV) O(logV) △
Differentiated Softmax
(Chen+, 2016)
O(V/K) O(V/K) O(V/K) ◎
Sampling
(many work)
O(V) O(K) O(V) ◯
Vocabulary Selection
(Mi et al., 2016)
(L’Hostis et al., 2016)
O(V) O(V) O(K) ◯
Character/Subword
(many work)
O(V') O(V') O(V') ◎
Binary Code O(K+logV) O(K+logV) O(K+logV) ◎
V: vocabulary size, K: model-specific parameter

Problems of Naïve Prediction
● Naïve binary code prediction models reduce accuracy.
● Cause 2: Unbalanced word frequency
● Frequent words bothers rare words
because all words share same parameters.
→　Separates models for frequent/rare words.
● Cause 1: Robustness of bit arrays
● One-off bit errors (even only 1 bits) generates different words.
→　Requires more redundant bit representations.

Applying Error-correcting Codes (1)
● Introduces robustness into bit arrays.
Original code 1 0 1
Redundant code 1 0 1 1 0 1 1 0 1
Encoding
Obtained code
with errors
1 0 1 0 0 1 1 1 1
ERR ERR
Restored code w/o errors 1 0 1
Decoding

Applying Error-correcting Codes (2)
● Introduces redundancy of bit arrays.
– We used a type of convolutional code that has some good
characteristics of our model.
Outputs
Ground
truth
Hidden
Train
Increases number of bits
word Encode
Training Test
Outputs
Hidden
Decode
Absorbs bit errors
word Decode

Softmax+Binary (Hybrid) Model
● Directly predicts frequent N words by softmax.
– N is set according to the corpus difficulty.
– Softmax layer ＝ Frequent words and "OTHER"
→　When "OTHER" was predicted, then use the binary layer.
Frequent words Rare words
Softmax Sigmoid
OTHER
Word scores Bits of word ID
Softmax size N

Experiments

Objective of Experiments
● Measures characteristics of binary code prediction models
by comparing with the softmax.
Translation Accuracy
BLEU
Memory Consumption
Size of Output Layers
Training
on GPUs
Testing
on GPUs/CPUs
(CPU: no multi-threading)
Processing Speed / Parallelism

0 30 60 90 120 150 180
0%
5%
10%
15%
20%
25%
30%
35%
Hybrid-512-EC Hybrid-512
Binary-EC Binary
Softmax
#trained minibatches (x1000)
BLEU
Results: Training Curves
● Languages: En→Ja
– Domain: Scientific Papers
(ASPEC 2M)
Hybrid-EC: 512+44 outputs
Hybrid: 512+16 outputs
Binary-EC: 44 outputs
Softmax: 65536 outputs
Binary: 16 outputs
● Naïve Binary prediction
is poor than others.
● Two additional methods
can improve accuracy
– Hybrid
– Error-correcting code
– Both

Results: Speed (ASPEC En→Ja)
Train (GPU; per minibatch) Test (GPU; per sent.) Test (CPU; per sent.)
0
500
1000
1500
2000
2500
3000
Softmax
Binary
Hybrid-512
Hybrid-2048
Binary-EC
Hybrid-512-EC
Hybrid-2048-EC
ProcessingTime[ms]
● 20-30% faster on GPUs
→No extra cost on parallel computation
Softmax
Our models
● x10 faster on CPUs.
→Can perform fast even on powerless devices.

Summary
● Proposed method
– NMT output layer based on binary code prediction
– 2 model improvements
● Hybrid models with Softmax and binary code
● Applying error-correcting codes
● Results
– Comparative BLEU with Softmax
– Reduces size of output layers to 1/10
– Speed-up (especially on CPUs by x10)
● Future work
– Introducing more efficient raw bit arrays/losses/error-correction
– Analyzing model (what happened by introducing binary code?)

Neural Machine Translation via Binary Code Prediction

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à Neural Machine Translation via Binary Code Prediction

Similaire à Neural Machine Translation via Binary Code Prediction (20)

Plus de Yusuke Oda

Plus de Yusuke Oda (10)

Dernier

Dernier (20)

Neural Machine Translation via Binary Code Prediction