An Efficient Language Model Using Double-Array Structures

EMNLP 2013

An Efficient Language Model
Using Double-Array Structures

Makoto Yasuhara, Toru Tanaka
Jun-ya Norimatsu, Mikio Yamamoto

University of Tsukuba, Japan

Introduction(1)
Bigger and Bigger LMs
Have you ever encountered these problems?
LMs cannot be load into memory because of their size
The query speed for LMs become a bottleneck of your system

Store compactly, query fast!

Our System Overview
• LM implementation based on double-array structures

• Modified double-array structure to store backward suffix trees

• Two optimization methods to improve efficiency

We call our LM “DALM”

Double-Array Structures
(Aoe, 1989)
What is a double-array structure?
A fast and compact representation of a trie
Abstract image
A trie is represented by two arrays (BASE and CHECK)
Double-array representation
ROOT

A

BASE 1
CHECK

1

1

B
ROOT

A

B

2D Array Implementation of a Trie
Node#
1
ROOT

A

2

3

B

C

4

5
A

C
7

C

1
2
3

B

C

2

3

4

5
6

4

6
B

5
6

7

7
Sparse array

Simple and fast but consumes a lot of memory

Compact Representation of a
Sparse 2D Array
Node#

A

1
2
3
4
5

B
2

4

C
3
5

Shift

6
7

2
3
Shift 3
Shift 3

4

5

6
7

Shift 4

6
7
Merge

Merged-NEXT

2

3

4

6

5

7

Information loss!

Double-array structure modified
to include all information about the original trie

Details of Double-Array Structures
(Aoe, 1989)
Definition:
Example:
ROOT
B

A

C

C
C

B

BASE

CHECK

0
0

1

2
3

3
3

4
0

5
0

6
4

7
0

0

0

2

3

2

6

Efficient Trie Representations for
Ngram Model
Backward suffix trees
(Bell et al., 1990; Stockle, 2002; Germann et al., 2009)
History words are stored in reverse order
Target words are stored in separated lists

X

ROOT

Y
Z

Efficient back-off
X

B

A

C

Y

Y
The B node is
not found

X

C

X

Endmarker Symbols for
Backwards Suffix Trees
Endmarker symbols (Aoe, 1989) are placed after history words

X

ROOT

ROOT

Y
Z

B

B
#

X

A

C

C

C

Z

#

C

#

X

Y

#

Y

X
Y

Y

A

X

Y

X
Target word follows
the endmarker symbol

X

X

Double-array Representation of
Backward Suffix Trees
Endmarker symbols are treated as words
A word ID is assigned to the endmarker symbol

X

ROOT

Y
Z

BASE
B

CHECK

0
0

1

2
2

3
4

4
0

5
0

6
4

7
0

0

2

2

3

3

3

Double-array Language Model:
Simple Structures
Introducing a VALUE array
ROOT
A

BASE
CHECK

X

B

B

A

0
0

1

2

X

#

3
2

4
5

0

5

6
4

3

VALUE

The VALUE array contains corresponding
probabilities and back-off weights (BOW)

7
6

Embedding structures (1)
Filling unused slots with values
ROOT
A

BASE
CHECK

X

B

A

0
0

1

2

X

#

3
2

4
5

0

3

5

6
4

7
6

B

Unused slots

These empty slots are used to store values

Embedding structures (2)
Using the BASE and CHECK arrays to store values
B

A

BASE
CHECK
VALUE
Lossless
quantization

0
0

1

2

X

#

3
2

4
5

0

3

5

6
4

7

-2

6

Index of the VALUE array
with a negative sign

Ordering method (1)
Tuning for word IDs
We assign word IDs in order of unigram probability
P(Word)

Word ID

-

Sort the words in
order of descending
probability

Word
#

1

0.0413

B

2

0.0300

X

3

0.0284

A

4

0.0201

Y

5

0.0101

C

6

0.0050

Z

7

0.0020

D

8

Ordering method (2)
Modifying the 2D array
Before ordering:
Node# #

1

3

2
3
4

A

B

C

D

6

2

9

1

3

2
3
4

Z

Z

D

11

8
B
2

6

X

4
6

9

Y
13

4
6

After ordering:
Node# #

X

8

11

A

Y
13

C

Experiments: Datasets
Model

100 Mwords
5 Gwords
Test set

Corpus size
[words]

Unique types
[words]

N-grams
(unigrams to
5-grams)

100 M

195 K

31 M

5G

2,140 K

936 M

100 M

198 K

-

Data source
Publication of unexamined Japanese patent applications
Distributed with the NTCIR 3,4,5,6 patent retrieval task
(Iwayama et al., 2003; Fujii et al., 2004;2005;2007)

Comparison: Proposed Methods
Results for 100-Mword corpus

Division Method
Building a large double-array structure needs a lot of time
(Nakamura and Mochizuki, 2006)

It is impractical to wait for the 5-Gword model to get built

Dividing the trie into several parts
ROOT
A

C

A

C

#

ROOT ROOT

#

C

#

C

#

Experiments: Division Methods
Results for 100-Mword corpus

Experiments: Other Methods
Results for 100-Mword and 5-Gword corpora

Discussion
DALM is smaller and faster than KenLM Probing
The smallest LM is KenLM Trie
The differences between KenLM Probing and DALM are
smaller for the 5-Gword model than for the 100-Mword model

Large language models require shorter back-off time

Conclusion
We proposed an efficient language model using double-array structures
• Double-array structures are a fast and compact representation of tries
• We use double-array structures to represent backward suffix trees

We proposed two optimization methods: embedding and ordering
• Embedding: using empty slots in the double-array to store values
• Ordering: tuning word IDs to make LMs smaller and faster

In experiments, DALM achieved the best speed among the compared
LMs though keeping modest model size.

Questions…
My English skills are limited 
Please speak slowly if you have any questions.

An Efficient Language Model Using Double-Array Structures

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à An Efficient Language Model Using Double-Array Structures

Similaire à An Efficient Language Model Using Double-Array Structures (20)

Dernier

Dernier (20)

An Efficient Language Model Using Double-Array Structures