SlideShare une entreprise Scribd logo
1  sur  23
EMNLP 2013

An Efficient Language Model
Using Double-Array Structures

Makoto Yasuhara, Toru Tanaka
Jun-ya Norimatsu, Mikio Yamamoto

University of Tsukuba, Japan
Introduction(1)
Bigger and Bigger LMs
Have you ever encountered these problems?
LMs cannot be load into memory because of their size
The query speed for LMs become a bottleneck of your system

Store compactly, query fast!
Our System Overview
• LM implementation based on double-array structures

• Modified double-array structure to store backward suffix trees

• Two optimization methods to improve efficiency

We call our LM “DALM”
Double-Array Structures
(Aoe, 1989)
What is a double-array structure?
A fast and compact representation of a trie
Abstract image
A trie is represented by two arrays (BASE and CHECK)
Double-array representation
ROOT

A

BASE 1
CHECK

1

1

B
ROOT

A

B
2D Array Implementation of a Trie
Node#
1
ROOT

A

2

3

B

C

4

5
A

C
7

C

1
2
3

B

C

2

3

4

5
6

4

6
B

5
6

7

7
Sparse array

Simple and fast but consumes a lot of memory
Compact Representation of a
Sparse 2D Array
Node#

A

1
2
3
4
5

B
2

4

C
3
5

Shift

6
7

2
3
Shift 3
Shift 3

4

5

6
7

Shift 4

6
7
Merge

Merged-NEXT

2

3

4

6

5

7

Information loss!

Double-array structure modified
to include all information about the original trie
Details of Double-Array Structures
(Aoe, 1989)
Definition:
Example:
ROOT
B

A

C

C
C

B

BASE

CHECK

0
0

1

2
3

3
3

4
0

5
0

6
4

7
0

0

0

2

3

2

6
Efficient Trie Representations for
Ngram Model
Backward suffix trees
(Bell et al., 1990; Stockle, 2002; Germann et al., 2009)
History words are stored in reverse order
Target words are stored in separated lists

X

ROOT

Y
Z

Efficient back-off
X

B

A

C

Y

Y
The B node is
not found

X

C

X
Endmarker Symbols for
Backwards Suffix Trees
Endmarker symbols (Aoe, 1989) are placed after history words

X

ROOT

ROOT

Y
Z

B

B
#

X

A

C

C

C

Z

#

C

#

X

Y

#

Y

X
Y

Y

A

X

Y

X
Target word follows
the endmarker symbol

X

X
Double-array Representation of
Backward Suffix Trees
Endmarker symbols are treated as words
A word ID is assigned to the endmarker symbol

X

ROOT

Y
Z

BASE
B

CHECK

0
0

1

2
2

3
4

4
0

5
0

6
4

7
0

0

2

2

3

3

3
Double-array Language Model:
Simple Structures
Introducing a VALUE array
ROOT
A

BASE
CHECK

X

B

B

A

0
0

1

2

X

#

3
2

4
5

0

5

6
4

3

VALUE

The VALUE array contains corresponding
probabilities and back-off weights (BOW)

7
6
Double-array Language Model:
Embedding structures (1)
Filling unused slots with values
ROOT
A

BASE
CHECK

X

B

A

0
0

1

2

X

#

3
2

4
5

0

3

5

6
4

7
6

B

Unused slots

These empty slots are used to store values
Double-array Language Model:
Embedding structures (2)
Using the BASE and CHECK arrays to store values
B

A

BASE
CHECK
VALUE
Lossless
quantization

0
0

1

2

X

#

3
2

4
5

0

3

5

6
4

7

-2

6

Index of the VALUE array
with a negative sign
Double-array Language Model:
Ordering method (1)
Tuning for word IDs
We assign word IDs in order of unigram probability
P(Word)

Word ID

-

Sort the words in
order of descending
probability

Word
#

1

0.0413

B

2

0.0300

X

3

0.0284

A

4

0.0201

Y

5

0.0101

C

6

0.0050

Z

7

0.0020

D

8
Double-array Language Model:
Ordering method (2)
Modifying the 2D array
Before ordering:
Node# #

1

3

2
3
4

A

B

C

D

6

2

9

1

3

2
3
4

Z

Z

D

11

8
B
2

6

X

4
6

9

Y
13

4
6

After ordering:
Node# #

X

8

11

A

Y
13

C
Experiments: Datasets
Model

100 Mwords
5 Gwords
Test set

Corpus size
[words]

Unique types
[words]

N-grams
(unigrams to
5-grams)

100 M

195 K

31 M

5G

2,140 K

936 M

100 M

198 K

-

Data source
Publication of unexamined Japanese patent applications
Distributed with the NTCIR 3,4,5,6 patent retrieval task
(Iwayama et al., 2003; Fujii et al., 2004;2005;2007)
Comparison: Proposed Methods
Results for 100-Mword corpus
Division Method
Building a large double-array structure needs a lot of time
(Nakamura and Mochizuki, 2006)

It is impractical to wait for the 5-Gword model to get built

Dividing the trie into several parts
ROOT
A

C

A

C

#

ROOT ROOT

#

C

#

C

#
Experiments: Division Methods
Results for 100-Mword corpus
Experiments: Other Methods
Results for 100-Mword and 5-Gword corpora
Discussion
DALM is smaller and faster than KenLM Probing
The smallest LM is KenLM Trie
The differences between KenLM Probing and DALM are
smaller for the 5-Gword model than for the 100-Mword model

Large language models require shorter back-off time
Conclusion
We proposed an efficient language model using double-array structures
• Double-array structures are a fast and compact representation of tries
• We use double-array structures to represent backward suffix trees

We proposed two optimization methods: embedding and ordering
• Embedding: using empty slots in the double-array to store values
• Ordering: tuning word IDs to make LMs smaller and faster

In experiments, DALM achieved the best speed among the compared
LMs though keeping modest model size.
Questions…
My English skills are limited 
Please speak slowly if you have any questions.

Contenu connexe

Tendances

AtCoder Beginner Contest 019 解説
AtCoder Beginner Contest 019 解説AtCoder Beginner Contest 019 解説
AtCoder Beginner Contest 019 解説AtCoder Inc.
 
パターン認識 08 09 k-近傍法 lvq
パターン認識 08 09 k-近傍法 lvqパターン認識 08 09 k-近傍法 lvq
パターン認識 08 09 k-近傍法 lvqsleipnir002
 
Graph in Data Structure
Graph in Data StructureGraph in Data Structure
Graph in Data StructureProf Ansari
 
色々なダイクストラ高速化
色々なダイクストラ高速化色々なダイクストラ高速化
色々なダイクストラ高速化yosupo
 
競プロは人生の役に立つ!
競プロは人生の役に立つ!競プロは人生の役に立つ!
競プロは人生の役に立つ!Kensuke Otsuki
 
Language Model (N-Gram).pptx
Language Model (N-Gram).pptxLanguage Model (N-Gram).pptx
Language Model (N-Gram).pptxHeneWijaya
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLPRupak Roy
 
Lecture 3 insertion sort and complexity analysis
Lecture 3   insertion sort and complexity analysisLecture 3   insertion sort and complexity analysis
Lecture 3 insertion sort and complexity analysisjayavignesh86
 
AtCoder Regular Contest 039 解説
AtCoder Regular Contest 039 解説AtCoder Regular Contest 039 解説
AtCoder Regular Contest 039 解説AtCoder Inc.
 
KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)khcoder
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
トピックモデルの評価指標 Coherence 研究まとめ #トピ本
トピックモデルの評価指標 Coherence 研究まとめ #トピ本トピックモデルの評価指標 Coherence 研究まとめ #トピ本
トピックモデルの評価指標 Coherence 研究まとめ #トピ本hoxo_m
 
Knapsack problem dynamicprogramming
Knapsack problem dynamicprogrammingKnapsack problem dynamicprogramming
Knapsack problem dynamicprogrammingrowntu
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
AtCoder Beginner Contest 010 解説
AtCoder Beginner Contest 010 解説AtCoder Beginner Contest 010 解説
AtCoder Beginner Contest 010 解説AtCoder Inc.
 
読書会 「トピックモデルによる統計的潜在意味解析」 第2回 3.2節 サンプリング近似法
読書会 「トピックモデルによる統計的潜在意味解析」 第2回 3.2節 サンプリング近似法読書会 「トピックモデルによる統計的潜在意味解析」 第2回 3.2節 サンプリング近似法
読書会 「トピックモデルによる統計的潜在意味解析」 第2回 3.2節 サンプリング近似法健児 青木
 

Tendances (20)

AtCoder Beginner Contest 019 解説
AtCoder Beginner Contest 019 解説AtCoder Beginner Contest 019 解説
AtCoder Beginner Contest 019 解説
 
パターン認識 08 09 k-近傍法 lvq
パターン認識 08 09 k-近傍法 lvqパターン認識 08 09 k-近傍法 lvq
パターン認識 08 09 k-近傍法 lvq
 
Graph in Data Structure
Graph in Data StructureGraph in Data Structure
Graph in Data Structure
 
色々なダイクストラ高速化
色々なダイクストラ高速化色々なダイクストラ高速化
色々なダイクストラ高速化
 
全域木いろいろ
全域木いろいろ全域木いろいろ
全域木いろいろ
 
競プロは人生の役に立つ!
競プロは人生の役に立つ!競プロは人生の役に立つ!
競プロは人生の役に立つ!
 
Language Model (N-Gram).pptx
Language Model (N-Gram).pptxLanguage Model (N-Gram).pptx
Language Model (N-Gram).pptx
 
Topic Modeling - NLP
Topic Modeling - NLPTopic Modeling - NLP
Topic Modeling - NLP
 
Lecture 3 insertion sort and complexity analysis
Lecture 3   insertion sort and complexity analysisLecture 3   insertion sort and complexity analysis
Lecture 3 insertion sort and complexity analysis
 
AtCoder Regular Contest 039 解説
AtCoder Regular Contest 039 解説AtCoder Regular Contest 039 解説
AtCoder Regular Contest 039 解説
 
KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)KH Coder 3 チュートリアル(スライド版)
KH Coder 3 チュートリアル(スライド版)
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Convex Hull Trick
Convex Hull TrickConvex Hull Trick
Convex Hull Trick
 
トピックモデルの評価指標 Coherence 研究まとめ #トピ本
トピックモデルの評価指標 Coherence 研究まとめ #トピ本トピックモデルの評価指標 Coherence 研究まとめ #トピ本
トピックモデルの評価指標 Coherence 研究まとめ #トピ本
 
Knapsack problem dynamicprogramming
Knapsack problem dynamicprogrammingKnapsack problem dynamicprogramming
Knapsack problem dynamicprogramming
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
AtCoder Beginner Contest 010 解説
AtCoder Beginner Contest 010 解説AtCoder Beginner Contest 010 解説
AtCoder Beginner Contest 010 解説
 
読書会 「トピックモデルによる統計的潜在意味解析」 第2回 3.2節 サンプリング近似法
読書会 「トピックモデルによる統計的潜在意味解析」 第2回 3.2節 サンプリング近似法読書会 「トピックモデルによる統計的潜在意味解析」 第2回 3.2節 サンプリング近似法
読書会 「トピックモデルによる統計的潜在意味解析」 第2回 3.2節 サンプリング近似法
 
一般グラフの最大マッチング
一般グラフの最大マッチング一般グラフの最大マッチング
一般グラフの最大マッチング
 

Similaire à An Efficient Language Model Using Double-Array Structures

C# Tutorial
C# Tutorial C# Tutorial
C# Tutorial Jm Ramos
 
Oracle sql tutorial
Oracle sql tutorialOracle sql tutorial
Oracle sql tutorialMohd Tousif
 
Semantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionSemantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionWhenty Ariyanti
 
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveNiko Neugebauer
 
Compressing column-oriented indexes
Compressing column-oriented indexesCompressing column-oriented indexes
Compressing column-oriented indexesDaniel Lemire
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionGiorgio Orsi
 
Faster Column-Oriented Indexes
Faster Column-Oriented IndexesFaster Column-Oriented Indexes
Faster Column-Oriented IndexesDaniel Lemire
 
Lecture 13
Lecture 13Lecture 13
Lecture 13Shani729
 
Structured Query Language (SQL) _ Edu4Sure Training.pptx
Structured Query Language (SQL) _ Edu4Sure Training.pptxStructured Query Language (SQL) _ Edu4Sure Training.pptx
Structured Query Language (SQL) _ Edu4Sure Training.pptxEdu4Sure
 
SQL, Oracle, Joins
SQL, Oracle, JoinsSQL, Oracle, Joins
SQL, Oracle, JoinsGaurish Goel
 
Sqlmaterial 120414024230-phpapp01
Sqlmaterial 120414024230-phpapp01Sqlmaterial 120414024230-phpapp01
Sqlmaterial 120414024230-phpapp01Lalit009kumar
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018Manish Pandey
 

Similaire à An Efficient Language Model Using Double-Array Structures (20)

C# Tutorial
C# Tutorial C# Tutorial
C# Tutorial
 
Encoding survey
Encoding surveyEncoding survey
Encoding survey
 
C# Basic Tutorial
C# Basic TutorialC# Basic Tutorial
C# Basic Tutorial
 
Oracle sql tutorial
Oracle sql tutorialOracle sql tutorial
Oracle sql tutorial
 
SQL
SQLSQL
SQL
 
Semantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech RecognitionSemantic Mask for Transformer Based End-to-End Speech Recognition
Semantic Mask for Transformer Based End-to-End Speech Recognition
 
Clustered Columnstore - Deep Dive
Clustered Columnstore - Deep DiveClustered Columnstore - Deep Dive
Clustered Columnstore - Deep Dive
 
Compressing column-oriented indexes
Compressing column-oriented indexesCompressing column-oriented indexes
Compressing column-oriented indexes
 
4.Database Management System.pdf
4.Database Management System.pdf4.Database Management System.pdf
4.Database Management System.pdf
 
SAE: Structured Aspect Extraction
SAE: Structured Aspect ExtractionSAE: Structured Aspect Extraction
SAE: Structured Aspect Extraction
 
Faster Column-Oriented Indexes
Faster Column-Oriented IndexesFaster Column-Oriented Indexes
Faster Column-Oriented Indexes
 
SISAP17
SISAP17SISAP17
SISAP17
 
Lecture 13
Lecture 13Lecture 13
Lecture 13
 
Aes
AesAes
Aes
 
Structured Query Language (SQL) _ Edu4Sure Training.pptx
Structured Query Language (SQL) _ Edu4Sure Training.pptxStructured Query Language (SQL) _ Edu4Sure Training.pptx
Structured Query Language (SQL) _ Edu4Sure Training.pptx
 
SQL, Oracle, Joins
SQL, Oracle, JoinsSQL, Oracle, Joins
SQL, Oracle, Joins
 
Sqlmaterial 120414024230-phpapp01
Sqlmaterial 120414024230-phpapp01Sqlmaterial 120414024230-phpapp01
Sqlmaterial 120414024230-phpapp01
 
Sql server lab_2
Sql server lab_2Sql server lab_2
Sql server lab_2
 
Keynote: Machine Learning for Design Automation at DAC 2018
Keynote:  Machine Learning for Design Automation at DAC 2018Keynote:  Machine Learning for Design Automation at DAC 2018
Keynote: Machine Learning for Design Automation at DAC 2018
 
Text compression
Text compressionText compression
Text compression
 

Dernier

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Dernier (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

An Efficient Language Model Using Double-Array Structures

  • 1. EMNLP 2013 An Efficient Language Model Using Double-Array Structures Makoto Yasuhara, Toru Tanaka Jun-ya Norimatsu, Mikio Yamamoto University of Tsukuba, Japan
  • 2. Introduction(1) Bigger and Bigger LMs Have you ever encountered these problems? LMs cannot be load into memory because of their size The query speed for LMs become a bottleneck of your system Store compactly, query fast!
  • 3. Our System Overview • LM implementation based on double-array structures • Modified double-array structure to store backward suffix trees • Two optimization methods to improve efficiency We call our LM “DALM”
  • 4. Double-Array Structures (Aoe, 1989) What is a double-array structure? A fast and compact representation of a trie Abstract image A trie is represented by two arrays (BASE and CHECK) Double-array representation ROOT A BASE 1 CHECK 1 1 B ROOT A B
  • 5. 2D Array Implementation of a Trie Node# 1 ROOT A 2 3 B C 4 5 A C 7 C 1 2 3 B C 2 3 4 5 6 4 6 B 5 6 7 7 Sparse array Simple and fast but consumes a lot of memory
  • 6. Compact Representation of a Sparse 2D Array Node# A 1 2 3 4 5 B 2 4 C 3 5 Shift 6 7 2 3 Shift 3 Shift 3 4 5 6 7 Shift 4 6 7 Merge Merged-NEXT 2 3 4 6 5 7 Information loss! Double-array structure modified to include all information about the original trie
  • 7. Details of Double-Array Structures (Aoe, 1989) Definition: Example: ROOT B A C C C B BASE CHECK 0 0 1 2 3 3 3 4 0 5 0 6 4 7 0 0 0 2 3 2 6
  • 8. Efficient Trie Representations for Ngram Model Backward suffix trees (Bell et al., 1990; Stockle, 2002; Germann et al., 2009) History words are stored in reverse order Target words are stored in separated lists X ROOT Y Z Efficient back-off X B A C Y Y The B node is not found X C X
  • 9. Endmarker Symbols for Backwards Suffix Trees Endmarker symbols (Aoe, 1989) are placed after history words X ROOT ROOT Y Z B B # X A C C C Z # C # X Y # Y X Y Y A X Y X Target word follows the endmarker symbol X X
  • 10. Double-array Representation of Backward Suffix Trees Endmarker symbols are treated as words A word ID is assigned to the endmarker symbol X ROOT Y Z BASE B CHECK 0 0 1 2 2 3 4 4 0 5 0 6 4 7 0 0 2 2 3 3 3
  • 11. Double-array Language Model: Simple Structures Introducing a VALUE array ROOT A BASE CHECK X B B A 0 0 1 2 X # 3 2 4 5 0 5 6 4 3 VALUE The VALUE array contains corresponding probabilities and back-off weights (BOW) 7 6
  • 12. Double-array Language Model: Embedding structures (1) Filling unused slots with values ROOT A BASE CHECK X B A 0 0 1 2 X # 3 2 4 5 0 3 5 6 4 7 6 B Unused slots These empty slots are used to store values
  • 13. Double-array Language Model: Embedding structures (2) Using the BASE and CHECK arrays to store values B A BASE CHECK VALUE Lossless quantization 0 0 1 2 X # 3 2 4 5 0 3 5 6 4 7 -2 6 Index of the VALUE array with a negative sign
  • 14. Double-array Language Model: Ordering method (1) Tuning for word IDs We assign word IDs in order of unigram probability P(Word) Word ID - Sort the words in order of descending probability Word # 1 0.0413 B 2 0.0300 X 3 0.0284 A 4 0.0201 Y 5 0.0101 C 6 0.0050 Z 7 0.0020 D 8
  • 15. Double-array Language Model: Ordering method (2) Modifying the 2D array Before ordering: Node# # 1 3 2 3 4 A B C D 6 2 9 1 3 2 3 4 Z Z D 11 8 B 2 6 X 4 6 9 Y 13 4 6 After ordering: Node# # X 8 11 A Y 13 C
  • 16. Experiments: Datasets Model 100 Mwords 5 Gwords Test set Corpus size [words] Unique types [words] N-grams (unigrams to 5-grams) 100 M 195 K 31 M 5G 2,140 K 936 M 100 M 198 K - Data source Publication of unexamined Japanese patent applications Distributed with the NTCIR 3,4,5,6 patent retrieval task (Iwayama et al., 2003; Fujii et al., 2004;2005;2007)
  • 17. Comparison: Proposed Methods Results for 100-Mword corpus
  • 18. Division Method Building a large double-array structure needs a lot of time (Nakamura and Mochizuki, 2006) It is impractical to wait for the 5-Gword model to get built Dividing the trie into several parts ROOT A C A C # ROOT ROOT # C # C #
  • 20. Experiments: Other Methods Results for 100-Mword and 5-Gword corpora
  • 21. Discussion DALM is smaller and faster than KenLM Probing The smallest LM is KenLM Trie The differences between KenLM Probing and DALM are smaller for the 5-Gword model than for the 100-Mword model Large language models require shorter back-off time
  • 22. Conclusion We proposed an efficient language model using double-array structures • Double-array structures are a fast and compact representation of tries • We use double-array structures to represent backward suffix trees We proposed two optimization methods: embedding and ordering • Embedding: using empty slots in the double-array to store values • Ordering: tuning word IDs to make LMs smaller and faster In experiments, DALM achieved the best speed among the compared LMs though keeping modest model size.
  • 23. Questions… My English skills are limited  Please speak slowly if you have any questions.