Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Neural Machine Translation via Binary Code Prediction
1. Neural Machine Translation
via Binary Code Prediction
Yusuke Oda (1)
Philip Arthur (1)
Graham Neubig (2, 1)
Koichiro Yoshino (1, 3)
Satoshi Nakamura (1)
(1) Nara Institute of Science and Technology
(2) Carnegie Mellon University
(3) Japan Science and Technology Agency
2. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 2
Motivation of This Work
● Neural machine translation models tends to be HEAVY
– Due to softmax that requires O(V) matrix multiplication
● Reduce computation amount of the output layer
HEAVY.
Let's reduce it.
3. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 3
Requirements of Output Layers
● Less memory allow to load the model
on resource-restricted devices.Memory
● Less computation allow to run the model
not on expensive processors.Speed
● Model parallelization is also important
to keep fast training.Parallelism
4. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 4
Proposed: Binary Code Prediction
● Predicts bits of word ID numbers instead of label.
Softmax
O(V)
Binary Code Prediction
O(log V)
h
01 10
arg max word ID
h
Outputs
predicted IDs word ID
Loss: cross entropy
(or equivalent one)
Loss: bit-wise errors
(used squared loss in experiments)
Softmax Sigmoid
6. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 6
Problems of Naïve Prediction
● Naïve binary code prediction models reduce accuracy.
● Cause 2: Unbalanced word frequency
● Frequent words bothers rare words
because all words share same parameters.
→ Separates models for frequent/rare words.
● Cause 1: Robustness of bit arrays
● One-off bit errors (even only 1 bits) generates different words.
→ Requires more redundant bit representations.
7. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 7
Applying Error-correcting Codes (1)
● Introduces robustness into bit arrays.
Original code 1 0 1
Redundant code 1 0 1 1 0 1 1 0 1
Encoding
Obtained code
with errors
1 0 1 0 0 1 1 1 1
ERR ERR
Restored code w/o errors 1 0 1
Decoding
8. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 8
Applying Error-correcting Codes (2)
● Introduces redundancy of bit arrays.
– We used a type of convolutional code that has some good
characteristics of our model.
Outputs
Ground
truth
Hidden
Train
Increases number of bits
word Encode
Training Test
Outputs
Hidden
Decode
Absorbs bit errors
word Decode
9. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 9
Softmax+Binary (Hybrid) Model
● Directly predicts frequent N words by softmax.
– N is set according to the corpus difficulty.
– Softmax layer = Frequent words and "OTHER"
→ When "OTHER" was predicted, then use the binary layer.
Frequent words Rare words
Softmax Sigmoid
OTHER
Word scores Bits of word ID
Softmax size N
11. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 11
Objective of Experiments
● Measures characteristics of binary code prediction models
by comparing with the softmax.
Translation Accuracy
BLEU
Memory Consumption
Size of Output Layers
Training
on GPUs
Testing
on GPUs/CPUs
(CPU: no multi-threading)
Processing Speed / Parallelism
12. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 12
0 30 60 90 120 150 180
0%
5%
10%
15%
20%
25%
30%
35%
Hybrid-512-EC Hybrid-512
Binary-EC Binary
Softmax
#trained minibatches (x1000)
BLEU
Results: Training Curves
● Languages: En→Ja
– Domain: Scientific Papers
(ASPEC 2M)
Hybrid-EC: 512+44 outputs
Hybrid: 512+16 outputs
Binary-EC: 44 outputs
Softmax: 65536 outputs
Binary: 16 outputs
● Naïve Binary prediction
is poor than others.
● Two additional methods
can improve accuracy
– Hybrid
– Error-correcting code
– Both
13. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 13
Results: Speed (ASPEC En→Ja)
Train (GPU; per minibatch) Test (GPU; per sent.) Test (CPU; per sent.)
0
500
1000
1500
2000
2500
3000
Softmax
Binary
Hybrid-512
Hybrid-2048
Binary-EC
Hybrid-512-EC
Hybrid-2048-EC
ProcessingTime[ms]
● 20-30% faster on GPUs
→No extra cost on parallel computation
Softmax
Our models
● x10 faster on CPUs.
→Can perform fast even on powerless devices.
14. 2017/08/01 Copyright (c) 2017 by Yusuke Oda. All Rights Reserved. 14
Summary
● Proposed method
– NMT output layer based on binary code prediction
– 2 model improvements
● Hybrid models with Softmax and binary code
● Applying error-correcting codes
● Results
– Comparative BLEU with Softmax
– Reduces size of output layers to 1/10
– Speed-up (especially on CPUs by x10)
● Future work
– Introducing more efficient raw bit arrays/losses/error-correction
– Analyzing model (what happened by introducing binary code?)