A presentation slide of EMNLP 2013.
Paper : http://aclweb.org/anthology/D/D13/
Direct Link : http://aclweb.org/anthology/D/D13/D13-1023.pdf
Source Code : https://github.com/jnory/DALM
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
An Efficient Language Model Using Double-Array Structures
1. EMNLP 2013
An Efficient Language Model
Using Double-Array Structures
Makoto Yasuhara, Toru Tanaka
Jun-ya Norimatsu, Mikio Yamamoto
University of Tsukuba, Japan
2. Introduction(1)
Bigger and Bigger LMs
Have you ever encountered these problems?
LMs cannot be load into memory because of their size
The query speed for LMs become a bottleneck of your system
Store compactly, query fast!
3. Our System Overview
• LM implementation based on double-array structures
• Modified double-array structure to store backward suffix trees
• Two optimization methods to improve efficiency
We call our LM “DALM”
4. Double-Array Structures
(Aoe, 1989)
What is a double-array structure?
A fast and compact representation of a trie
Abstract image
A trie is represented by two arrays (BASE and CHECK)
Double-array representation
ROOT
A
BASE 1
CHECK
1
1
B
ROOT
A
B
5. 2D Array Implementation of a Trie
Node#
1
ROOT
A
2
3
B
C
4
5
A
C
7
C
1
2
3
B
C
2
3
4
5
6
4
6
B
5
6
7
7
Sparse array
Simple and fast but consumes a lot of memory
6. Compact Representation of a
Sparse 2D Array
Node#
A
1
2
3
4
5
B
2
4
C
3
5
Shift
6
7
2
3
Shift 3
Shift 3
4
5
6
7
Shift 4
6
7
Merge
Merged-NEXT
2
3
4
6
5
7
Information loss!
Double-array structure modified
to include all information about the original trie
7. Details of Double-Array Structures
(Aoe, 1989)
Definition:
Example:
ROOT
B
A
C
C
C
B
BASE
CHECK
0
0
1
2
3
3
3
4
0
5
0
6
4
7
0
0
0
2
3
2
6
8. Efficient Trie Representations for
Ngram Model
Backward suffix trees
(Bell et al., 1990; Stockle, 2002; Germann et al., 2009)
History words are stored in reverse order
Target words are stored in separated lists
X
ROOT
Y
Z
Efficient back-off
X
B
A
C
Y
Y
The B node is
not found
X
C
X
9. Endmarker Symbols for
Backwards Suffix Trees
Endmarker symbols (Aoe, 1989) are placed after history words
X
ROOT
ROOT
Y
Z
B
B
#
X
A
C
C
C
Z
#
C
#
X
Y
#
Y
X
Y
Y
A
X
Y
X
Target word follows
the endmarker symbol
X
X
10. Double-array Representation of
Backward Suffix Trees
Endmarker symbols are treated as words
A word ID is assigned to the endmarker symbol
X
ROOT
Y
Z
BASE
B
CHECK
0
0
1
2
2
3
4
4
0
5
0
6
4
7
0
0
2
2
3
3
3
11. Double-array Language Model:
Simple Structures
Introducing a VALUE array
ROOT
A
BASE
CHECK
X
B
B
A
0
0
1
2
X
#
3
2
4
5
0
5
6
4
3
VALUE
The VALUE array contains corresponding
probabilities and back-off weights (BOW)
7
6
12. Double-array Language Model:
Embedding structures (1)
Filling unused slots with values
ROOT
A
BASE
CHECK
X
B
A
0
0
1
2
X
#
3
2
4
5
0
3
5
6
4
7
6
B
Unused slots
These empty slots are used to store values
13. Double-array Language Model:
Embedding structures (2)
Using the BASE and CHECK arrays to store values
B
A
BASE
CHECK
VALUE
Lossless
quantization
0
0
1
2
X
#
3
2
4
5
0
3
5
6
4
7
-2
6
Index of the VALUE array
with a negative sign
14. Double-array Language Model:
Ordering method (1)
Tuning for word IDs
We assign word IDs in order of unigram probability
P(Word)
Word ID
-
Sort the words in
order of descending
probability
Word
#
1
0.0413
B
2
0.0300
X
3
0.0284
A
4
0.0201
Y
5
0.0101
C
6
0.0050
Z
7
0.0020
D
8
15. Double-array Language Model:
Ordering method (2)
Modifying the 2D array
Before ordering:
Node# #
1
3
2
3
4
A
B
C
D
6
2
9
1
3
2
3
4
Z
Z
D
11
8
B
2
6
X
4
6
9
Y
13
4
6
After ordering:
Node# #
X
8
11
A
Y
13
C
16. Experiments: Datasets
Model
100 Mwords
5 Gwords
Test set
Corpus size
[words]
Unique types
[words]
N-grams
(unigrams to
5-grams)
100 M
195 K
31 M
5G
2,140 K
936 M
100 M
198 K
-
Data source
Publication of unexamined Japanese patent applications
Distributed with the NTCIR 3,4,5,6 patent retrieval task
(Iwayama et al., 2003; Fujii et al., 2004;2005;2007)
18. Division Method
Building a large double-array structure needs a lot of time
(Nakamura and Mochizuki, 2006)
It is impractical to wait for the 5-Gword model to get built
Dividing the trie into several parts
ROOT
A
C
A
C
#
ROOT ROOT
#
C
#
C
#
21. Discussion
DALM is smaller and faster than KenLM Probing
The smallest LM is KenLM Trie
The differences between KenLM Probing and DALM are
smaller for the 5-Gword model than for the 100-Mword model
Large language models require shorter back-off time
22. Conclusion
We proposed an efficient language model using double-array structures
• Double-array structures are a fast and compact representation of tries
• We use double-array structures to represent backward suffix trees
We proposed two optimization methods: embedding and ordering
• Embedding: using empty slots in the double-array to store values
• Ordering: tuning word IDs to make LMs smaller and faster
In experiments, DALM achieved the best speed among the compared
LMs though keeping modest model size.