3. 読む論文
•Large-Scale Learning with Less RAM via
Randomization (ICML13)
•D. Golovin, C. Sculley, H. B. McMahan, M.
Young (Google)
•http://arxiv.org/abs/1303.4664 (論文)
•NIPS2012のWorkshopが初出
•図/表は,論文より引用
3
7. float型は実用上オーバースペック
7
Large-Scale Learning with Less R
Figure 1. Histogram of coe cients in a typical large-scale
linear model trained from real data. Values are tightly
grouped near zero; a large dynamic range is superfluous.
Contributions This paper gives the following theo-
(Va
pap
lear
Pe
(20
stra
(i.e
dic
rat
ing
tivi
32bitもの精度で
値を保持する必要があるのか?
線形分類器の重みベクトル値のヒストグラム図
特徴種類数
11. アルゴリズム
11
Large-Scale Learning with Less RAM v
Algorithm 1 OGD-Rand-1d
input: feasible set F = [ R, R], learning rate schedule
⌘t, resolution schedule ✏t
define fun Project ( ) = max( R, min( , R))
Initialize ˆ1 = 0
for t=1, . . . , T do
Play the point ˆt, observe gt
t+1 = Project ˆt ⌘tgt
ˆt+1 RandomRound( t+1, ✏t)
function RandomRound( , ✏)
a ✏
⌅
✏
⇧
; b ✏
⌃
✏
⌥
return
(
b with prob. ( a)/✏
a otherwise
grid of re
ing rate
for each
implying
appropri
assume t
+R are a
dimensio
each coor
cretizatio
is a regre
be found
Theorem
adaptive
discretiza
普通のSGD
Random Roundで
bit数を に落とすQn.m
13. gap の決め方
•SGDのステップ幅 と任意の定数 より,
• を満たすようにbit長を設定
•この時,Regretの上限が で求まる
13
⌘t > 0
✏t ⌘t
with Less RAM via Randomization
edule
enta-
n (by
ll to
y be
head
ning
work
ding
grid of resolution ✏t on round t, and an adaptive learn-
ing rate ⌘t. We then run one copy of this algorithm
for each coordinate of the original convex problem,
implying that we can choose the ⌘t and ✏t schedules
appropriately for each coordinate. For simplicity, we
assume the ✏t resolutions are chosen so that R and
+R are always gridpoints. Algorithm 1 gives the one-
dimensional version, which is run independently on
each coordinate (with a di↵erent learning rate and dis-
cretization schedule) in Algorithm 2. The core result
is a regret bound for Algorithm 1 (omitted proofs can
be found in the Appendix):
Theorem 3.1. Consider running Algorithm 1 with
adaptive non-increasing learning-rate schedule ⌘t, and
discretization schedule ✏t such that ✏t ⌘t for a con-
stant > 0. Then, against any sequence of gradi-
ents g1, . . . , gT (possibly selected by an adaptive ad-
versary) with |gt| G, against any comparator point
⇤
2 [ R, R], we have
E[Regret( ⇤
)]
(2R)2
2⌘T
+
1
2
(G2
+ 2
)⌘1:T + R
p
T.
By choosing su ciently small, we obtain an expected
regret bound that is indistinguishable from the non-
rounded version (which is obtained by taking = 0).
Thm. 3.1.
where
Large-Scale Learning with Less RAM via Randomization
orithm 1 OGD-Rand-1d
put: feasible set F = [ R, R], learning rate schedule
resolution schedule ✏t
fine fun Project ( ) = max( R, min( , R))
tialize ˆ1 = 0
r t=1, . . . , T do
Play the point ˆt, observe gt
t+1 = Project ˆt ⌘tgt
ˆt+1 RandomRound( t+1, ✏t)
nction RandomRound( , ✏)
a ✏
⌅
✏
⇧
; b ✏
⌃
✏
⌥
return
(
b with prob. ( a)/✏
a otherwise
onverting back to a floating point representa-
requires a single integer-float multiplication (by
2 m
). Randomized rounding requires a call to
eudo-random number generator, which may be
in 18-20 flops. Overall, the added CPU overhead
gligible, especially as many large-scale learning
hods are I/O bound reading from disk or network
grid of resolution ✏t on round t, and an
ing rate ⌘t. We then run one copy o
for each coordinate of the original c
implying that we can choose the ⌘t a
appropriately for each coordinate. Fo
assume the ✏t resolutions are chosen
+R are always gridpoints. Algorithm
dimensional version, which is run in
each coordinate (with a di↵erent learn
cretization schedule) in Algorithm 2.
is a regret bound for Algorithm 1 (om
be found in the Appendix):
Theorem 3.1. Consider running A
adaptive non-increasing learning-rate
discretization schedule ✏t such that ✏t
stant > 0. Then, against any seq
ents g1, . . . , gT (possibly selected by
versary) with |gt| G, against any c
⇤
2 [ R, R], we have
E[Regret( ⇤
)]
(2R)2
2⌘T
+
1
2
(G2
+ 2
⌘t
✏t
✏t
O(
p
T)
! 0 でfloat型のRegret上限
15. Morris Algorithm
•頻度カウンタを確率変数にする
•特徴が出現するたびに,以下の操作を行う
•確率 でCを1インクリメント
• の値を頻度として返す
• は真の頻度カウント数の不偏推定量
15
on,
on
or-
act
p-
sts
ng
ht-
es-
is
an
he
he
ke
learning rate that decreases over time, e.g., setting ⌘t
proportional to 1/
p
t. Per-coordinate learning rates
require storing a unique count ⌧i for each coordinate,
where ⌧i is the number of times coordinate i has ap-
peared with a non-zero gradient so far. Significant
space is saved by using a 8-bit randomized counting
scheme rather than a 32-bit (or 64-bit) integer to store
the d total counts. We use a variant of Morris’ prob-
abilistic counting algorithm (1978) analyzed by Flajo-
let (1985). Specifically, we initialize a counter C = 1,
and on each increment operation, we increment C with
probability p(C) = b C
, where base b is a parameter.
We estimate the count as ˜⌧(C) = bC
b
b 1 , which is an
unbiased estimator of the true count. We then use
learning rates ⌘t,i = ↵/
p
˜⌧t,i + 1, which ensures that
even when ˜⌧t,i = 0 we don’t divide by zero.
We compute high-probability bounds on this counter
in Lemma A.1. Using these bounds for ⌘t,i in conjunc-
tion with Theorem 3.1, we obtain the following result
(proof deferred to the appendix).
t decreases over time, e.g., setting ⌘t
1/
p
t. Per-coordinate learning rates
unique count ⌧i for each coordinate,
number of times coordinate i has ap-
non-zero gradient so far. Significant
y using a 8-bit randomized counting
an a 32-bit (or 64-bit) integer to store
ts. We use a variant of Morris’ prob-
g algorithm (1978) analyzed by Flajo-
fically, we initialize a counter C = 1,
ement operation, we increment C with
= b C
, where base b is a parameter.
count as ˜⌧(C) = bC
b
b 1 , which is an
tor of the true count. We then use
,i = ↵/
p
˜⌧t,i + 1, which ensures that
0 we don’t divide by zero.
h-probability bounds on this counter
Using these bounds for ⌘t,i in conjunc-
em 3.1, we obtain the following result
˜(C)
16. Per-Coordinate版アルゴリズム
16
Large-Scale Learning with Less RAM via
Algorithm 2 OGD-Rand
input: feasible set F = [ R, R]d
, parameters ↵, > 0
Initialize ˆ1 = 0 2 Rd
; 8i, ⌧i = 0
for t=1, . . . , T do
Play the point ˆt, observe loss function ft
for i=1, . . . , d do
let gt,i = rft(xt)i
if gt,i = 0 then continue
⌧i ⌧i + 1
let ⌘t,i = ↵/
p
⌧i and ✏t,i = ⌘t,i
t+1,i Project ˆt,i ⌘t,igt,i
ˆt+1,i RandomRound( t+1,i, ✏t,i)
example rare words in a bag-of-words representation,
identified by a binary feature), using a fine-precision
s of all coe
store a sec
the coe ci
rank/select
one of Patr
3.2. Appr
Online con
learning ra
proportion
require sto
where ⌧i is
peared wit
space is sa
scheme rat
頻度のカウンティング.Morris s Algo.により8bits化可能.
頻度情報を使ってステップ幅を決定.
17. 予測時はさらに近似可能
• 予測時は予測への影響が少なければ,bit数はかな
り大胆に減らせる
• Lemma 4.1, 4.2, Theorem 4.3: Logistic Loss
の場合の近似の程度に応じた発生しうる誤差分析
• さらに圧縮を使えば情報論的下限までメモリを削減
可能
17
Large-Scale Learning with Less RAM via Randomization
Figure 1. Histogram of coe cients in a typical large-scale
linear model trained from real data. Values are tightly
grouped near zero; a large dynamic range is superfluous.
Contributions This paper gives the following theo-
retical and empirical results:
(Van Durme & Lall, 2009). T
paper gives the first algorithms
learning with randomized roun
Per-Coordinate Learning
(2010) and McMahan & St
strated that per-coordinate a
(i.e., adaptive learning rates)
diction accuracy. The intuition
rate for common features decre
ing the learning rate high for ra
tivity increases RAM cost by r
statistic to be stored for each
ただし,下限は
あまり小さくない
20. CTR Dataset
非公開の検索広告クリックログデータ
20
Data Feature
CTR 30M 20M
2. Rounding at Training Time. The fixed q2.13 encoding is 50% smaller than control with no loss. Per-coo
g rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to
adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of a
on is seen more on the larger CTR data.
o↵ error is no longer an issue. This allows even
aggressive rounding to be used safely.
E[ˆi] = i, then for any x 2 {0, 1}d
our predict
odds ratio, ˆ·x is distributed as a sum of indep
Data Feature
CTR Billion Billion
のデータでもほとんど同じ結果だよと言ってる
21. 予測モデルの近似性能
21
Large-Scale Learning with Less RAM
Table 1. Rounding at Prediction Time for CTR Data.
Fixed-point encodings are compared to a 32-bit floating
point control model. Added loss is negligible even when
using only 1.5 bits per value with optimal encoding.
Encoding AucLoss Opt. Bits/Val
q2.3 +5.72% 0.1
q2.5 +0.44% 0.5
q2.7 +0.03% 1.5
q2.9 +0.00% 3.3
ues may be encoded with fewer bits. The theoret-
ical bound for a whole model with d coe cients isPd
i=1 log p( i)
d bits per value, where p(v) is the proba-
bility of occurrence of v in across all dimensions d.
of trade
with ra
to plo
domized
these di
Using a
coding
pared to
learning
tive per
consum
64 bits p
ized cou
nate. H
tive pre
情報論的下限まで小さく
した時のサイズ