SlideShare une entreprise Scribd logo
1  sur  22
Télécharger pour lire hors ligne
Large-Scale Learning with Less RAM
via Randomization
[Golovin+ ICML13]
ICML読み会 2013/07/09
Hidekazu Oiwa (@kisa12012)
自己紹介
•大岩 秀和
•東大D2@中川研
•専門:機械学習や自然言語処理
•得意技:大規模データ最適化/スパース化
•Twitter: @kisa12012
2
読む論文
•Large-Scale Learning with Less RAM via
Randomization (ICML13)
•D. Golovin, C. Sculley, H. B. McMahan, M.
Young (Google)
•http://arxiv.org/abs/1303.4664 (論文)
•NIPS2012のWorkshopが初出
•図/表は,論文より引用
3
一枚概要
• 重みベクトルの省メモリ化
• GPU/L1-cacheに載せるためビット数を自動調節
• SGDベースのアルゴリズムを提案
• 精度をほとんど落とさず,メモリ使用量の削減を実現
• 学習時:50%, 予測時:95%
• Regretによる理論保証もあり
4
float (32bits) Ex. (5bits)
= (1.0, . . . , 1.52)
Q2,2
ˆ = (1.0, . . . , 1.50)
導入
背景:ビッグデータ!!
• メモリ容量が重要な制約に
• 全データがメモリに載らない
• GPU/L1Cacheでデータ処理可能か?
• 学習時のみでなく予測時にも重要
• 検索広告やメールフィルタのレイテンシに影響
• 重みベクトルのメモリ量を削減したい
6
float型は実用上オーバースペック
7
Large-Scale Learning with Less R
Figure 1. Histogram of coe cients in a typical large-scale
linear model trained from real data. Values are tightly
grouped near zero; a large dynamic range is superfluous.
Contributions This paper gives the following theo-
(Va
pap
lear
Pe
(20
stra
(i.e
dic
rat
ing
tivi
32bitもの精度で
値を保持する必要があるのか?
線形分類器の重みベクトル値のヒストグラム図
特徴種類数
• 固定長にすれば良い?
• 最適な重みベクトルが固定bit長で表せない場
合,永遠に収束しない
• アイデア
• 学習時はステップ幅に応じて表現方法を変える
bit数を削減 : どのように?
8
⇤
アルゴリズム
bit長の記法の定義
• :固定bit長表現の記法
• :仮数部
• :指数部
• では(n+m+1)bits使用
•1bitは符号
• :表現可能な点の間のgap
10
Qn.m
Qn.m
✏ ✏
n n m
m
1.5 ⇥ 2 1
アルゴリズム
11
Large-Scale Learning with Less RAM v
Algorithm 1 OGD-Rand-1d
input: feasible set F = [ R, R], learning rate schedule
⌘t, resolution schedule ✏t
define fun Project ( ) = max( R, min( , R))
Initialize ˆ1 = 0
for t=1, . . . , T do
Play the point ˆt, observe gt
t+1 = Project ˆt ⌘tgt
ˆt+1 RandomRound( t+1, ✏t)
function RandomRound( , ✏)
a ✏
⌅
✏
⇧
; b ✏
⌃
✏
⌥
return
(
b with prob. ( a)/✏
a otherwise
grid of re
ing rate
for each
implying
appropri
assume t
+R are a
dimensio
each coor
cretizatio
is a regre
be found
Theorem
adaptive
discretiza
普通のSGD
Random Roundで
bit数を に落とすQn.m
Random Round図示
12
a : b
b
a + b
a
a + b
%
ˆ ˆ
1 : 4
80% 20%
ˆ ˆ
Example
gap の決め方
•SGDのステップ幅 と任意の定数 より,
• を満たすようにbit長を設定
•この時,Regretの上限が で求まる
13
⌘t > 0
✏t  ⌘t
with Less RAM via Randomization
edule
enta-
n (by
ll to
y be
head
ning
work
ding
grid of resolution ✏t on round t, and an adaptive learn-
ing rate ⌘t. We then run one copy of this algorithm
for each coordinate of the original convex problem,
implying that we can choose the ⌘t and ✏t schedules
appropriately for each coordinate. For simplicity, we
assume the ✏t resolutions are chosen so that R and
+R are always gridpoints. Algorithm 1 gives the one-
dimensional version, which is run independently on
each coordinate (with a di↵erent learning rate and dis-
cretization schedule) in Algorithm 2. The core result
is a regret bound for Algorithm 1 (omitted proofs can
be found in the Appendix):
Theorem 3.1. Consider running Algorithm 1 with
adaptive non-increasing learning-rate schedule ⌘t, and
discretization schedule ✏t such that ✏t  ⌘t for a con-
stant > 0. Then, against any sequence of gradi-
ents g1, . . . , gT (possibly selected by an adaptive ad-
versary) with |gt|  G, against any comparator point
⇤
2 [ R, R], we have
E[Regret( ⇤
)] 
(2R)2
2⌘T
+
1
2
(G2
+ 2
)⌘1:T + R
p
T.
By choosing su ciently small, we obtain an expected
regret bound that is indistinguishable from the non-
rounded version (which is obtained by taking = 0).
Thm. 3.1.
where
Large-Scale Learning with Less RAM via Randomization
orithm 1 OGD-Rand-1d
put: feasible set F = [ R, R], learning rate schedule
resolution schedule ✏t
fine fun Project ( ) = max( R, min( , R))
tialize ˆ1 = 0
r t=1, . . . , T do
Play the point ˆt, observe gt
t+1 = Project ˆt ⌘tgt
ˆt+1 RandomRound( t+1, ✏t)
nction RandomRound( , ✏)
a ✏
⌅
✏
⇧
; b ✏
⌃
✏
⌥
return
(
b with prob. ( a)/✏
a otherwise
onverting back to a floating point representa-
requires a single integer-float multiplication (by
2 m
). Randomized rounding requires a call to
eudo-random number generator, which may be
in 18-20 flops. Overall, the added CPU overhead
gligible, especially as many large-scale learning
hods are I/O bound reading from disk or network
grid of resolution ✏t on round t, and an
ing rate ⌘t. We then run one copy o
for each coordinate of the original c
implying that we can choose the ⌘t a
appropriately for each coordinate. Fo
assume the ✏t resolutions are chosen
+R are always gridpoints. Algorithm
dimensional version, which is run in
each coordinate (with a di↵erent learn
cretization schedule) in Algorithm 2.
is a regret bound for Algorithm 1 (om
be found in the Appendix):
Theorem 3.1. Consider running A
adaptive non-increasing learning-rate
discretization schedule ✏t such that ✏t
stant > 0. Then, against any seq
ents g1, . . . , gT (possibly selected by
versary) with |gt|  G, against any c
⇤
2 [ R, R], we have
E[Regret( ⇤
)] 
(2R)2
2⌘T
+
1
2
(G2
+ 2
⌘t
✏t
✏t
O(
p
T)
! 0 でfloat型のRegret上限
Per-coordinate learning rates (a.k.a. AdaGrad)
[Duchi+ COLT10]
• 特徴毎にステップ幅を変化
• 頻繁に出現する特徴は,ステップ幅をどんどん下げる
• 稀にしか出現しない特徴は,ステップ幅をあまり下げ
ない
• 非常に高速に最適解を求めるための手法
• 各特徴の出現回数を保持しなければならない
• 32bits int型
• Morris algorithm [Morris+ 78] で近似 -> 8bits
14
Morris Algorithm
•頻度カウンタを確率変数にする
•特徴が出現するたびに,以下の操作を行う
•確率 でCを1インクリメント
• の値を頻度として返す
• は真の頻度カウント数の不偏推定量
15
on,
on
or-
act
p-
sts
ng
ht-
es-
is
an
he
he
ke
learning rate that decreases over time, e.g., setting ⌘t
proportional to 1/
p
t. Per-coordinate learning rates
require storing a unique count ⌧i for each coordinate,
where ⌧i is the number of times coordinate i has ap-
peared with a non-zero gradient so far. Significant
space is saved by using a 8-bit randomized counting
scheme rather than a 32-bit (or 64-bit) integer to store
the d total counts. We use a variant of Morris’ prob-
abilistic counting algorithm (1978) analyzed by Flajo-
let (1985). Specifically, we initialize a counter C = 1,
and on each increment operation, we increment C with
probability p(C) = b C
, where base b is a parameter.
We estimate the count as ˜⌧(C) = bC
b
b 1 , which is an
unbiased estimator of the true count. We then use
learning rates ⌘t,i = ↵/
p
˜⌧t,i + 1, which ensures that
even when ˜⌧t,i = 0 we don’t divide by zero.
We compute high-probability bounds on this counter
in Lemma A.1. Using these bounds for ⌘t,i in conjunc-
tion with Theorem 3.1, we obtain the following result
(proof deferred to the appendix).
t decreases over time, e.g., setting ⌘t
1/
p
t. Per-coordinate learning rates
unique count ⌧i for each coordinate,
number of times coordinate i has ap-
non-zero gradient so far. Significant
y using a 8-bit randomized counting
an a 32-bit (or 64-bit) integer to store
ts. We use a variant of Morris’ prob-
g algorithm (1978) analyzed by Flajo-
fically, we initialize a counter C = 1,
ement operation, we increment C with
= b C
, where base b is a parameter.
count as ˜⌧(C) = bC
b
b 1 , which is an
tor of the true count. We then use
,i = ↵/
p
˜⌧t,i + 1, which ensures that
0 we don’t divide by zero.
h-probability bounds on this counter
Using these bounds for ⌘t,i in conjunc-
em 3.1, we obtain the following result
˜(C)
Per-Coordinate版アルゴリズム
16
Large-Scale Learning with Less RAM via
Algorithm 2 OGD-Rand
input: feasible set F = [ R, R]d
, parameters ↵, > 0
Initialize ˆ1 = 0 2 Rd
; 8i, ⌧i = 0
for t=1, . . . , T do
Play the point ˆt, observe loss function ft
for i=1, . . . , d do
let gt,i = rft(xt)i
if gt,i = 0 then continue
⌧i ⌧i + 1
let ⌘t,i = ↵/
p
⌧i and ✏t,i = ⌘t,i
t+1,i Project ˆt,i ⌘t,igt,i
ˆt+1,i RandomRound( t+1,i, ✏t,i)
example rare words in a bag-of-words representation,
identified by a binary feature), using a fine-precision
s of all coe
store a sec
the coe ci
rank/select
one of Patr
3.2. Appr
Online con
learning ra
proportion
require sto
where ⌧i is
peared wit
space is sa
scheme rat
頻度のカウンティング.Morris s Algo.により8bits化可能.
頻度情報を使ってステップ幅を決定.
予測時はさらに近似可能
• 予測時は予測への影響が少なければ,bit数はかな
り大胆に減らせる
• Lemma 4.1, 4.2, Theorem 4.3: Logistic Loss
の場合の近似の程度に応じた発生しうる誤差分析
• さらに圧縮を使えば情報論的下限までメモリを削減
可能
17
Large-Scale Learning with Less RAM via Randomization
Figure 1. Histogram of coe cients in a typical large-scale
linear model trained from real data. Values are tightly
grouped near zero; a large dynamic range is superfluous.
Contributions This paper gives the following theo-
retical and empirical results:
(Van Durme & Lall, 2009). T
paper gives the first algorithms
learning with randomized roun
Per-Coordinate Learning
(2010) and McMahan & St
strated that per-coordinate a
(i.e., adaptive learning rates)
diction accuracy. The intuition
rate for common features decre
ing the learning rate high for ra
tivity increases RAM cost by r
statistic to be stored for each
ただし,下限は
あまり小さくない
実験
RCV1 Dataset
19
Large-Scale Learning with Less RAM via Randomization
Train Test Feature
RCV1 20,242 677,399 47,236
CTR Dataset
非公開の検索広告クリックログデータ
20
Data Feature
CTR 30M 20M
2. Rounding at Training Time. The fixed q2.13 encoding is 50% smaller than control with no loss. Per-coo
g rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to
adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of a
on is seen more on the larger CTR data.
o↵ error is no longer an issue. This allows even
aggressive rounding to be used safely.
E[ˆi] = i, then for any x 2 {0, 1}d
our predict
odds ratio, ˆ·x is distributed as a sum of indep
Data Feature
CTR Billion Billion
のデータでもほとんど同じ結果だよと言ってる
予測モデルの近似性能
21
Large-Scale Learning with Less RAM
Table 1. Rounding at Prediction Time for CTR Data.
Fixed-point encodings are compared to a 32-bit floating
point control model. Added loss is negligible even when
using only 1.5 bits per value with optimal encoding.
Encoding AucLoss Opt. Bits/Val
q2.3 +5.72% 0.1
q2.5 +0.44% 0.5
q2.7 +0.03% 1.5
q2.9 +0.00% 3.3
ues may be encoded with fewer bits. The theoret-
ical bound for a whole model with d coe cients isPd
i=1 log p( i)
d bits per value, where p(v) is the proba-
bility of occurrence of v in across all dimensions d.
of trade
with ra
to plo
domized
these di
Using a
coding
pared to
learning
tive per
consum
64 bits p
ized cou
nate. H
tive pre
情報論的下限まで小さく
した時のサイズ
まとめ
• 重みベクトルの省メモリ化
• Randomized Rounding
• GPUやL1cacheに載る形で学習/予測可能
• FOBOS等への拡張もStraightforward
• と著者は書いていて,脚注にProof Sketch
• 本当に成立するかどうか,各自調べる必要がありそう
22
float (32bits) Ex. (5bits)
= (1.0, . . . , 1.52)
Q2,2
ˆ = (1.0, . . . , 1.50)

Contenu connexe

Tendances

SPU Optimizations - Part 2
SPU Optimizations - Part 2SPU Optimizations - Part 2
SPU Optimizations - Part 2
Naughty Dog
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
Naughty Dog
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02
Muhammad Aslam
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal process
nozyh
 
SPU Optimizations-part 1
SPU Optimizations-part 1SPU Optimizations-part 1
SPU Optimizations-part 1
Naughty Dog
 

Tendances (20)

SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...SchNet: A continuous-filter convolutional neural network for modeling quantum...
SchNet: A continuous-filter convolutional neural network for modeling quantum...
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
NIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph ConvolutionNIPS2017 Few-shot Learning and Graph Convolution
NIPS2017 Few-shot Learning and Graph Convolution
 
SPU Optimizations - Part 2
SPU Optimizations - Part 2SPU Optimizations - Part 2
SPU Optimizations - Part 2
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
 
19 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp0219 algorithms-and-complexity-110627100203-phpapp02
19 algorithms-and-complexity-110627100203-phpapp02
 
"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net"Deep Learning" Chap.6 Convolutional Neural Net
"Deep Learning" Chap.6 Convolutional Neural Net
 
Lecture12 xing
Lecture12 xingLecture12 xing
Lecture12 xing
 
NIPS読み会2013: One-shot learning by inverting a compositional causal process
NIPS読み会2013: One-shot learning by inverting  a compositional causal processNIPS読み会2013: One-shot learning by inverting  a compositional causal process
NIPS読み会2013: One-shot learning by inverting a compositional causal process
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
SPU Optimizations-part 1
SPU Optimizations-part 1SPU Optimizations-part 1
SPU Optimizations-part 1
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing
 
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHMADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
ADAPTIVE FUZZY KERNEL CLUSTERING ALGORITHM
 
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
Automatic Gain Tuning based on Gaussian Process Global Optimization (= Bayesi...
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
 
Neural Processes Family
Neural Processes FamilyNeural Processes Family
Neural Processes Family
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
Implementing 64-bit Maximally Equidistributed F2-Linear Generators with Merse...
Implementing 64-bit Maximally Equidistributed F2-Linear Generators with Merse...Implementing 64-bit Maximally Equidistributed F2-Linear Generators with Merse...
Implementing 64-bit Maximally Equidistributed F2-Linear Generators with Merse...
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 6
 
Pytorch for tf_developers
Pytorch for tf_developersPytorch for tf_developers
Pytorch for tf_developers
 

Similaire à ICML2013読み会 Large-Scale Learning with Less RAM via Randomization

Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in R
htstatistics
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
KarthikVijay59
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
Naoki Shibata
 
LightGBM and Multilayer perceptron (MLP) slide
LightGBM and Multilayer perceptron (MLP) slideLightGBM and Multilayer perceptron (MLP) slide
LightGBM and Multilayer perceptron (MLP) slide
riahaque1950
 

Similaire à ICML2013読み会 Large-Scale Learning with Less RAM via Randomization (20)

Algorithm Analysis.pdf
Algorithm Analysis.pdfAlgorithm Analysis.pdf
Algorithm Analysis.pdf
 
Complexity Analysis
Complexity Analysis Complexity Analysis
Complexity Analysis
 
Data_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.pptData_Structure_and_Algorithms_Lecture_1.ppt
Data_Structure_and_Algorithms_Lecture_1.ppt
 
Fcm1
Fcm1Fcm1
Fcm1
 
Fcm1
Fcm1Fcm1
Fcm1
 
Multiplicative Interaction Models in R
Multiplicative Interaction Models in RMultiplicative Interaction Models in R
Multiplicative Interaction Models in R
 
Data structures notes for college students btech.pptx
Data structures notes for college students btech.pptxData structures notes for college students btech.pptx
Data structures notes for college students btech.pptx
 
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
Martin Takac - “Solving Large-Scale Machine Learning Problems in a Distribute...
 
Introduction to Algorithms
Introduction to AlgorithmsIntroduction to Algorithms
Introduction to Algorithms
 
Lecture2a algorithm
Lecture2a algorithmLecture2a algorithm
Lecture2a algorithm
 
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
(Slides) Efficient Evaluation Methods of Elementary Functions Suitable for SI...
 
LightGBM and Multilayer perceptron (MLP) slide
LightGBM and Multilayer perceptron (MLP) slideLightGBM and Multilayer perceptron (MLP) slide
LightGBM and Multilayer perceptron (MLP) slide
 
Data Structures and Algorithms Lecture 2: Analysis of Algorithms, Asymptotic ...
Data Structures and Algorithms Lecture 2: Analysis of Algorithms, Asymptotic ...Data Structures and Algorithms Lecture 2: Analysis of Algorithms, Asymptotic ...
Data Structures and Algorithms Lecture 2: Analysis of Algorithms, Asymptotic ...
 
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
 
Python faster for loop
Python faster for loopPython faster for loop
Python faster for loop
 
Code Tuning
Code TuningCode Tuning
Code Tuning
 
Unit ii algorithm
Unit   ii algorithmUnit   ii algorithm
Unit ii algorithm
 
Efficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketchingEfficient anomaly detection via matrix sketching
Efficient anomaly detection via matrix sketching
 
6. Implementation
6. Implementation6. Implementation
6. Implementation
 
Xgboost
XgboostXgboost
Xgboost
 

Plus de Hidekazu Oiwa (12)

NIPS2014読み会 NIPS参加報告
NIPS2014読み会 NIPS参加報告NIPS2014読み会 NIPS参加報告
NIPS2014読み会 NIPS参加報告
 
SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来SGD+α: 確率的勾配降下法の現在と未来
SGD+α: 確率的勾配降下法の現在と未来
 
Incentive Compatible Regression Learning (Mathematical Informatics Reading)
Incentive Compatible Regression Learning (Mathematical Informatics Reading)Incentive Compatible Regression Learning (Mathematical Informatics Reading)
Incentive Compatible Regression Learning (Mathematical Informatics Reading)
 
PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)PoisoningAttackSVM (ICMLreading2012)
PoisoningAttackSVM (ICMLreading2012)
 
FOBOS
FOBOSFOBOS
FOBOS
 
OnlineClassifiers
OnlineClassifiersOnlineClassifiers
OnlineClassifiers
 
Prml9
Prml9Prml9
Prml9
 
IBMModel2
IBMModel2IBMModel2
IBMModel2
 
Pfi last seminar
Pfi last seminarPfi last seminar
Pfi last seminar
 
NLPforml5
NLPforml5NLPforml5
NLPforml5
 
PRML5
PRML5PRML5
PRML5
 
Arow
ArowArow
Arow
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization

  • 1. Large-Scale Learning with Less RAM via Randomization [Golovin+ ICML13] ICML読み会 2013/07/09 Hidekazu Oiwa (@kisa12012)
  • 3. 読む論文 •Large-Scale Learning with Less RAM via Randomization (ICML13) •D. Golovin, C. Sculley, H. B. McMahan, M. Young (Google) •http://arxiv.org/abs/1303.4664 (論文) •NIPS2012のWorkshopが初出 •図/表は,論文より引用 3
  • 4. 一枚概要 • 重みベクトルの省メモリ化 • GPU/L1-cacheに載せるためビット数を自動調節 • SGDベースのアルゴリズムを提案 • 精度をほとんど落とさず,メモリ使用量の削減を実現 • 学習時:50%, 予測時:95% • Regretによる理論保証もあり 4 float (32bits) Ex. (5bits) = (1.0, . . . , 1.52) Q2,2 ˆ = (1.0, . . . , 1.50)
  • 6. 背景:ビッグデータ!! • メモリ容量が重要な制約に • 全データがメモリに載らない • GPU/L1Cacheでデータ処理可能か? • 学習時のみでなく予測時にも重要 • 検索広告やメールフィルタのレイテンシに影響 • 重みベクトルのメモリ量を削減したい 6
  • 7. float型は実用上オーバースペック 7 Large-Scale Learning with Less R Figure 1. Histogram of coe cients in a typical large-scale linear model trained from real data. Values are tightly grouped near zero; a large dynamic range is superfluous. Contributions This paper gives the following theo- (Va pap lear Pe (20 stra (i.e dic rat ing tivi 32bitもの精度で 値を保持する必要があるのか? 線形分類器の重みベクトル値のヒストグラム図 特徴種類数
  • 8. • 固定長にすれば良い? • 最適な重みベクトルが固定bit長で表せない場 合,永遠に収束しない • アイデア • 学習時はステップ幅に応じて表現方法を変える bit数を削減 : どのように? 8 ⇤
  • 10. bit長の記法の定義 • :固定bit長表現の記法 • :仮数部 • :指数部 • では(n+m+1)bits使用 •1bitは符号 • :表現可能な点の間のgap 10 Qn.m Qn.m ✏ ✏ n n m m 1.5 ⇥ 2 1
  • 11. アルゴリズム 11 Large-Scale Learning with Less RAM v Algorithm 1 OGD-Rand-1d input: feasible set F = [ R, R], learning rate schedule ⌘t, resolution schedule ✏t define fun Project ( ) = max( R, min( , R)) Initialize ˆ1 = 0 for t=1, . . . , T do Play the point ˆt, observe gt t+1 = Project ˆt ⌘tgt ˆt+1 RandomRound( t+1, ✏t) function RandomRound( , ✏) a ✏ ⌅ ✏ ⇧ ; b ✏ ⌃ ✏ ⌥ return ( b with prob. ( a)/✏ a otherwise grid of re ing rate for each implying appropri assume t +R are a dimensio each coor cretizatio is a regre be found Theorem adaptive discretiza 普通のSGD Random Roundで bit数を に落とすQn.m
  • 12. Random Round図示 12 a : b b a + b a a + b % ˆ ˆ 1 : 4 80% 20% ˆ ˆ Example
  • 13. gap の決め方 •SGDのステップ幅 と任意の定数 より, • を満たすようにbit長を設定 •この時,Regretの上限が で求まる 13 ⌘t > 0 ✏t  ⌘t with Less RAM via Randomization edule enta- n (by ll to y be head ning work ding grid of resolution ✏t on round t, and an adaptive learn- ing rate ⌘t. We then run one copy of this algorithm for each coordinate of the original convex problem, implying that we can choose the ⌘t and ✏t schedules appropriately for each coordinate. For simplicity, we assume the ✏t resolutions are chosen so that R and +R are always gridpoints. Algorithm 1 gives the one- dimensional version, which is run independently on each coordinate (with a di↵erent learning rate and dis- cretization schedule) in Algorithm 2. The core result is a regret bound for Algorithm 1 (omitted proofs can be found in the Appendix): Theorem 3.1. Consider running Algorithm 1 with adaptive non-increasing learning-rate schedule ⌘t, and discretization schedule ✏t such that ✏t  ⌘t for a con- stant > 0. Then, against any sequence of gradi- ents g1, . . . , gT (possibly selected by an adaptive ad- versary) with |gt|  G, against any comparator point ⇤ 2 [ R, R], we have E[Regret( ⇤ )]  (2R)2 2⌘T + 1 2 (G2 + 2 )⌘1:T + R p T. By choosing su ciently small, we obtain an expected regret bound that is indistinguishable from the non- rounded version (which is obtained by taking = 0). Thm. 3.1. where Large-Scale Learning with Less RAM via Randomization orithm 1 OGD-Rand-1d put: feasible set F = [ R, R], learning rate schedule resolution schedule ✏t fine fun Project ( ) = max( R, min( , R)) tialize ˆ1 = 0 r t=1, . . . , T do Play the point ˆt, observe gt t+1 = Project ˆt ⌘tgt ˆt+1 RandomRound( t+1, ✏t) nction RandomRound( , ✏) a ✏ ⌅ ✏ ⇧ ; b ✏ ⌃ ✏ ⌥ return ( b with prob. ( a)/✏ a otherwise onverting back to a floating point representa- requires a single integer-float multiplication (by 2 m ). Randomized rounding requires a call to eudo-random number generator, which may be in 18-20 flops. Overall, the added CPU overhead gligible, especially as many large-scale learning hods are I/O bound reading from disk or network grid of resolution ✏t on round t, and an ing rate ⌘t. We then run one copy o for each coordinate of the original c implying that we can choose the ⌘t a appropriately for each coordinate. Fo assume the ✏t resolutions are chosen +R are always gridpoints. Algorithm dimensional version, which is run in each coordinate (with a di↵erent learn cretization schedule) in Algorithm 2. is a regret bound for Algorithm 1 (om be found in the Appendix): Theorem 3.1. Consider running A adaptive non-increasing learning-rate discretization schedule ✏t such that ✏t stant > 0. Then, against any seq ents g1, . . . , gT (possibly selected by versary) with |gt|  G, against any c ⇤ 2 [ R, R], we have E[Regret( ⇤ )]  (2R)2 2⌘T + 1 2 (G2 + 2 ⌘t ✏t ✏t O( p T) ! 0 でfloat型のRegret上限
  • 14. Per-coordinate learning rates (a.k.a. AdaGrad) [Duchi+ COLT10] • 特徴毎にステップ幅を変化 • 頻繁に出現する特徴は,ステップ幅をどんどん下げる • 稀にしか出現しない特徴は,ステップ幅をあまり下げ ない • 非常に高速に最適解を求めるための手法 • 各特徴の出現回数を保持しなければならない • 32bits int型 • Morris algorithm [Morris+ 78] で近似 -> 8bits 14
  • 15. Morris Algorithm •頻度カウンタを確率変数にする •特徴が出現するたびに,以下の操作を行う •確率 でCを1インクリメント • の値を頻度として返す • は真の頻度カウント数の不偏推定量 15 on, on or- act p- sts ng ht- es- is an he he ke learning rate that decreases over time, e.g., setting ⌘t proportional to 1/ p t. Per-coordinate learning rates require storing a unique count ⌧i for each coordinate, where ⌧i is the number of times coordinate i has ap- peared with a non-zero gradient so far. Significant space is saved by using a 8-bit randomized counting scheme rather than a 32-bit (or 64-bit) integer to store the d total counts. We use a variant of Morris’ prob- abilistic counting algorithm (1978) analyzed by Flajo- let (1985). Specifically, we initialize a counter C = 1, and on each increment operation, we increment C with probability p(C) = b C , where base b is a parameter. We estimate the count as ˜⌧(C) = bC b b 1 , which is an unbiased estimator of the true count. We then use learning rates ⌘t,i = ↵/ p ˜⌧t,i + 1, which ensures that even when ˜⌧t,i = 0 we don’t divide by zero. We compute high-probability bounds on this counter in Lemma A.1. Using these bounds for ⌘t,i in conjunc- tion with Theorem 3.1, we obtain the following result (proof deferred to the appendix). t decreases over time, e.g., setting ⌘t 1/ p t. Per-coordinate learning rates unique count ⌧i for each coordinate, number of times coordinate i has ap- non-zero gradient so far. Significant y using a 8-bit randomized counting an a 32-bit (or 64-bit) integer to store ts. We use a variant of Morris’ prob- g algorithm (1978) analyzed by Flajo- fically, we initialize a counter C = 1, ement operation, we increment C with = b C , where base b is a parameter. count as ˜⌧(C) = bC b b 1 , which is an tor of the true count. We then use ,i = ↵/ p ˜⌧t,i + 1, which ensures that 0 we don’t divide by zero. h-probability bounds on this counter Using these bounds for ⌘t,i in conjunc- em 3.1, we obtain the following result ˜(C)
  • 16. Per-Coordinate版アルゴリズム 16 Large-Scale Learning with Less RAM via Algorithm 2 OGD-Rand input: feasible set F = [ R, R]d , parameters ↵, > 0 Initialize ˆ1 = 0 2 Rd ; 8i, ⌧i = 0 for t=1, . . . , T do Play the point ˆt, observe loss function ft for i=1, . . . , d do let gt,i = rft(xt)i if gt,i = 0 then continue ⌧i ⌧i + 1 let ⌘t,i = ↵/ p ⌧i and ✏t,i = ⌘t,i t+1,i Project ˆt,i ⌘t,igt,i ˆt+1,i RandomRound( t+1,i, ✏t,i) example rare words in a bag-of-words representation, identified by a binary feature), using a fine-precision s of all coe store a sec the coe ci rank/select one of Patr 3.2. Appr Online con learning ra proportion require sto where ⌧i is peared wit space is sa scheme rat 頻度のカウンティング.Morris s Algo.により8bits化可能. 頻度情報を使ってステップ幅を決定.
  • 17. 予測時はさらに近似可能 • 予測時は予測への影響が少なければ,bit数はかな り大胆に減らせる • Lemma 4.1, 4.2, Theorem 4.3: Logistic Loss の場合の近似の程度に応じた発生しうる誤差分析 • さらに圧縮を使えば情報論的下限までメモリを削減 可能 17 Large-Scale Learning with Less RAM via Randomization Figure 1. Histogram of coe cients in a typical large-scale linear model trained from real data. Values are tightly grouped near zero; a large dynamic range is superfluous. Contributions This paper gives the following theo- retical and empirical results: (Van Durme & Lall, 2009). T paper gives the first algorithms learning with randomized roun Per-Coordinate Learning (2010) and McMahan & St strated that per-coordinate a (i.e., adaptive learning rates) diction accuracy. The intuition rate for common features decre ing the learning rate high for ra tivity increases RAM cost by r statistic to be stored for each ただし,下限は あまり小さくない
  • 19. RCV1 Dataset 19 Large-Scale Learning with Less RAM via Randomization Train Test Feature RCV1 20,242 677,399 47,236
  • 20. CTR Dataset 非公開の検索広告クリックログデータ 20 Data Feature CTR 30M 20M 2. Rounding at Training Time. The fixed q2.13 encoding is 50% smaller than control with no loss. Per-coo g rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of a on is seen more on the larger CTR data. o↵ error is no longer an issue. This allows even aggressive rounding to be used safely. E[ˆi] = i, then for any x 2 {0, 1}d our predict odds ratio, ˆ·x is distributed as a sum of indep Data Feature CTR Billion Billion のデータでもほとんど同じ結果だよと言ってる
  • 21. 予測モデルの近似性能 21 Large-Scale Learning with Less RAM Table 1. Rounding at Prediction Time for CTR Data. Fixed-point encodings are compared to a 32-bit floating point control model. Added loss is negligible even when using only 1.5 bits per value with optimal encoding. Encoding AucLoss Opt. Bits/Val q2.3 +5.72% 0.1 q2.5 +0.44% 0.5 q2.7 +0.03% 1.5 q2.9 +0.00% 3.3 ues may be encoded with fewer bits. The theoret- ical bound for a whole model with d coe cients isPd i=1 log p( i) d bits per value, where p(v) is the proba- bility of occurrence of v in across all dimensions d. of trade with ra to plo domized these di Using a coding pared to learning tive per consum 64 bits p ized cou nate. H tive pre 情報論的下限まで小さく した時のサイズ
  • 22. まとめ • 重みベクトルの省メモリ化 • Randomized Rounding • GPUやL1cacheに載る形で学習/予測可能 • FOBOS等への拡張もStraightforward • と著者は書いていて,脚注にProof Sketch • 本当に成立するかどうか,各自調べる必要がありそう 22 float (32bits) Ex. (5bits) = (1.0, . . . , 1.52) Q2,2 ˆ = (1.0, . . . , 1.50)