ICML2013読み会 Large-Scale Learning with Less RAM via Randomization

Large-Scale Learning with Less RAM
via Randomization
[Golovin+ ICML13]
ICML読み会 2013/07/09
Hidekazu Oiwa (@kisa12012)

自己紹介
•大岩秀和
•東大D2@中川研
•専門：機械学習や自然言語処理
•得意技：大規模データ最適化／スパース化
•Twitter: @kisa12012
2

読む論文
•Large-Scale Learning with Less RAM via
Randomization (ICML13)
•D. Golovin, C. Sculley, H. B. McMahan, M.
Young (Google)
•http://arxiv.org/abs/1303.4664 (論文)
•NIPS2012のWorkshopが初出
•図／表は，論文より引用
3

一枚概要
• 重みベクトルの省メモリ化
• GPU/L1-cacheに載せるためビット数を自動調節
• SGDベースのアルゴリズムを提案
• 精度をほとんど落とさず，メモリ使用量の削減を実現
• 学習時：50%, 予測時：95%
• Regretによる理論保証もあり
4
ﬂoat (32bits) Ex. (5bits)
= (1.0, . . . , 1.52)
Q2,2
ˆ = (1.0, . . . , 1.50)

背景：ビッグデータ！！
• メモリ容量が重要な制約に
• 全データがメモリに載らない
• GPU／L1Cacheでデータ処理可能か？
• 学習時のみでなく予測時にも重要
• 検索広告やメールフィルタのレイテンシに影響
• 重みベクトルのメモリ量を削減したい
6

ﬂoat型は実用上オーバースペック
7
Large-Scale Learning with Less R
Figure 1. Histogram of coe cients in a typical large-scale
linear model trained from real data. Values are tightly
grouped near zero; a large dynamic range is superﬂuous.
Contributions This paper gives the following theo-
(Va
pap
lear
Pe
(20
stra
(i.e
dic
rat
ing
tivi
32bitもの精度で
値を保持する必要があるのか？
線形分類器の重みベクトル値のヒストグラム図
特徴種類数

• 固定長にすれば良い？
• 最適な重みベクトルが固定bit長で表せない場
合，永遠に収束しない
• アイデア
• 学習時はステップ幅に応じて表現方法を変える
bit数を削減 : どのように？
8
⇤

bit長の記法の定義
• ：固定bit長表現の記法
• ：仮数部
• ：指数部
• では(n+m+1)bits使用
•1bitは符号
• ：表現可能な点の間のgap
10
Qn.m
Qn.m
✏ ✏
n n m
m
1.5 ⇥ 2 1

アルゴリズム
11
Large-Scale Learning with Less RAM v
Algorithm 1 OGD-Rand-1d
input: feasible set F = [ R, R], learning rate schedule
⌘t, resolution schedule ✏t
deﬁne fun Project ( ) = max( R, min( , R))
Initialize ˆ1 = 0
for t=1, . . . , T do
Play the point ˆt, observe gt
t+1 = Project ˆt ⌘tgt
ˆt+1 RandomRound( t+1, ✏t)
function RandomRound( , ✏)
a ✏
⌅
✏
⇧
; b ✏
⌃
✏
⌥
return
(
b with prob. ( a)/✏
a otherwise
grid of re
ing rate
for each
implying
appropri
assume t
+R are a
dimensio
each coor
cretizatio
is a regre
be found
Theorem
adaptive
discretiza
普通のSGD
Random Roundで
bit数をに落とすQn.m

Random Round図示
12
a : b
b
a + b
a
a + b
％
ˆ ˆ
1 : 4
80％ 20％
ˆ ˆ
Example

gap の決め方
•SGDのステップ幅と任意の定数より，
• を満たすようにbit長を設定
•この時，Regretの上限がで求まる
13
⌘t > 0
✏t  ⌘t
with Less RAM via Randomization
edule
enta-
n (by
ll to
y be
head
ning
work
ding
grid of resolution ✏t on round t, and an adaptive learn-
ing rate ⌘t. We then run one copy of this algorithm
for each coordinate of the original convex problem,
implying that we can choose the ⌘t and ✏t schedules
appropriately for each coordinate. For simplicity, we
assume the ✏t resolutions are chosen so that R and
+R are always gridpoints. Algorithm 1 gives the one-
dimensional version, which is run independently on
each coordinate (with a di↵erent learning rate and dis-
cretization schedule) in Algorithm 2. The core result
is a regret bound for Algorithm 1 (omitted proofs can
be found in the Appendix):
Theorem 3.1. Consider running Algorithm 1 with
adaptive non-increasing learning-rate schedule ⌘t, and
discretization schedule ✏t such that ✏t  ⌘t for a con-
stant > 0. Then, against any sequence of gradi-
ents g1, . . . , gT (possibly selected by an adaptive ad-
versary) with |gt|  G, against any comparator point
⇤
2 [ R, R], we have
E[Regret( ⇤
)] 
(2R)2
2⌘T
+
1
2
(G2
+ 2
)⌘1:T + R
p
T.
By choosing su ciently small, we obtain an expected
regret bound that is indistinguishable from the non-
rounded version (which is obtained by taking = 0).
Thm. 3.1.
where
Large-Scale Learning with Less RAM via Randomization
orithm 1 OGD-Rand-1d
put: feasible set F = [ R, R], learning rate schedule
resolution schedule ✏t
fine fun Project ( ) = max( R, min( , R))
tialize ˆ1 = 0
r t=1, . . . , T do
Play the point ˆt, observe gt
t+1 = Project ˆt ⌘tgt
ˆt+1 RandomRound( t+1, ✏t)
nction RandomRound( , ✏)
a ✏
⌅
✏
⇧
; b ✏
⌃
✏
⌥
return
(
b with prob. ( a)/✏
a otherwise
onverting back to a floating point representa-
requires a single integer-float multiplication (by
2 m
). Randomized rounding requires a call to
eudo-random number generator, which may be
in 18-20 flops. Overall, the added CPU overhead
gligible, especially as many large-scale learning
hods are I/O bound reading from disk or network
grid of resolution ✏t on round t, and an
ing rate ⌘t. We then run one copy o
for each coordinate of the original c
implying that we can choose the ⌘t a
appropriately for each coordinate. Fo
assume the ✏t resolutions are chosen
+R are always gridpoints. Algorithm
dimensional version, which is run in
each coordinate (with a di↵erent learn
cretization schedule) in Algorithm 2.
is a regret bound for Algorithm 1 (om
be found in the Appendix):
Theorem 3.1. Consider running A
adaptive non-increasing learning-rate
discretization schedule ✏t such that ✏t
stant > 0. Then, against any seq
ents g1, . . . , gT (possibly selected by
versary) with |gt|  G, against any c
⇤
2 [ R, R], we have
E[Regret( ⇤
)] 
(2R)2
2⌘T
+
1
2
(G2
+ 2
⌘t
✏t
✏t
O(
p
T)
! 0 でfloat型のRegret上限

Per-coordinate learning rates (a.k.a. AdaGrad)
[Duchi+ COLT10]
• 特徴毎にステップ幅を変化
• 頻繁に出現する特徴は，ステップ幅をどんどん下げる
• 稀にしか出現しない特徴は，ステップ幅をあまり下げ
ない
• 非常に高速に最適解を求めるための手法
• 各特徴の出現回数を保持しなければならない
• 32bits int型
• Morris algorithm [Morris+ 78] で近似 -> 8bits
14

Morris Algorithm
•頻度カウンタを確率変数にする
•特徴が出現するたびに，以下の操作を行う
•確率でCを1インクリメント
• の値を頻度として返す
• は真の頻度カウント数の不偏推定量
15
on,
on
or-
act
p-
sts
ng
ht-
es-
is
an
he
he
ke
learning rate that decreases over time, e.g., setting ⌘t
proportional to 1/
p
t. Per-coordinate learning rates
require storing a unique count ⌧i for each coordinate,
where ⌧i is the number of times coordinate i has ap-
peared with a non-zero gradient so far. Significant
space is saved by using a 8-bit randomized counting
scheme rather than a 32-bit (or 64-bit) integer to store
the d total counts. We use a variant of Morris’ prob-
abilistic counting algorithm (1978) analyzed by Flajo-
let (1985). Specifically, we initialize a counter C = 1,
and on each increment operation, we increment C with
probability p(C) = b C
, where base b is a parameter.
We estimate the count as ˜⌧(C) = bC
b
b 1 , which is an
unbiased estimator of the true count. We then use
learning rates ⌘t,i = ↵/
p
˜⌧t,i + 1, which ensures that
even when ˜⌧t,i = 0 we don’t divide by zero.
We compute high-probability bounds on this counter
in Lemma A.1. Using these bounds for ⌘t,i in conjunc-
tion with Theorem 3.1, we obtain the following result
(proof deferred to the appendix).
t decreases over time, e.g., setting ⌘t
1/
p
t. Per-coordinate learning rates
unique count ⌧i for each coordinate,
number of times coordinate i has ap-
non-zero gradient so far. Significant
y using a 8-bit randomized counting
an a 32-bit (or 64-bit) integer to store
ts. We use a variant of Morris’ prob-
g algorithm (1978) analyzed by Flajo-
fically, we initialize a counter C = 1,
ement operation, we increment C with
= b C
, where base b is a parameter.
count as ˜⌧(C) = bC
b
b 1 , which is an
tor of the true count. We then use
,i = ↵/
p
˜⌧t,i + 1, which ensures that
0 we don’t divide by zero.
h-probability bounds on this counter
Using these bounds for ⌘t,i in conjunc-
em 3.1, we obtain the following result
˜(C)

Per-Coordinate版アルゴリズム
16
Large-Scale Learning with Less RAM via
Algorithm 2 OGD-Rand
input: feasible set F = [ R, R]d
, parameters ↵, > 0
Initialize ˆ1 = 0 2 Rd
; 8i, ⌧i = 0
for t=1, . . . , T do
Play the point ˆt, observe loss function ft
for i=1, . . . , d do
let gt,i = rft(xt)i
if gt,i = 0 then continue
⌧i ⌧i + 1
let ⌘t,i = ↵/
p
⌧i and ✏t,i = ⌘t,i
t+1,i Project ˆt,i ⌘t,igt,i
ˆt+1,i RandomRound( t+1,i, ✏t,i)
example rare words in a bag-of-words representation,
identiﬁed by a binary feature), using a ﬁne-precision
s of all coe
store a sec
the coe ci
rank/select
one of Patr
3.2. Appr
Online con
learning ra
proportion
require sto
where ⌧i is
peared wit
space is sa
scheme rat
頻度のカウンティング．Morris s Algo.により8bits化可能．
頻度情報を使ってステップ幅を決定．

予測時はさらに近似可能
• 予測時は予測への影響が少なければ，bit数はかな
り大胆に減らせる
• Lemma 4.1, 4.2, Theorem 4.3: Logistic Loss
の場合の近似の程度に応じた発生しうる誤差分析
• さらに圧縮を使えば情報論的下限までメモリを削減
可能
17
Figure 1. Histogram of coe cients in a typical large-scale
linear model trained from real data. Values are tightly
grouped near zero; a large dynamic range is superﬂuous.
Contributions This paper gives the following theo-
retical and empirical results:
(Van Durme & Lall, 2009). T
paper gives the ﬁrst algorithms
learning with randomized roun
Per-Coordinate Learning
(2010) and McMahan & St
strated that per-coordinate a
(i.e., adaptive learning rates)
diction accuracy. The intuition
rate for common features decre
ing the learning rate high for ra
tivity increases RAM cost by r
statistic to be stored for each
ただし，下限は
あまり小さくない

RCV1 Dataset
19
Train Test Feature
RCV1 20,242 677,399 47,236

CTR Dataset
非公開の検索広告クリックログデータ
20
Data Feature
CTR 30M 20M
2. Rounding at Training Time. The fixed q2.13 encoding is 50% smaller than control with no loss. Per-coo
g rates significantly improve predictions but use 64 bits per value. Randomized counting reduces this to
adaptive or fixed precision reduces memory use further, to 24 total bits per value or less. The benefit of a
on is seen more on the larger CTR data.
o↵ error is no longer an issue. This allows even
aggressive rounding to be used safely.
E[î] = i, then for any x 2 {0, 1}d
our predict
odds ratio, ˆ·x is distributed as a sum of indep
Data Feature
CTR Billion Billion
のデータでもほとんど同じ結果だよと言ってる

予測モデルの近似性能
21
Large-Scale Learning with Less RAM
Table 1. Rounding at Prediction Time for CTR Data.
Fixed-point encodings are compared to a 32-bit ﬂoating
point control model. Added loss is negligible even when
using only 1.5 bits per value with optimal encoding.
Encoding AucLoss Opt. Bits/Val
q2.3 +5.72% 0.1
q2.5 +0.44% 0.5
q2.7 +0.03% 1.5
q2.9 +0.00% 3.3
ues may be encoded with fewer bits. The theoret-
ical bound for a whole model with d coe cients isPd
i=1 log p( i)
d bits per value, where p(v) is the proba-
bility of occurrence of v in across all dimensions d.
of trade
with ra
to plo
domized
these di
Using a
coding
pared to
learning
tive per
consum
64 bits p
ized cou
nate. H
tive pre
情報論的下限まで小さく
した時のサイズ

まとめ
• 重みベクトルの省メモリ化
• Randomized Rounding
• GPUやL1cacheに載る形で学習／予測可能
• FOBOS等への拡張もStraightforward
• と著者は書いていて，脚注にProof Sketch
• 本当に成立するかどうか，各自調べる必要がありそう
22
ﬂoat (32bits) Ex. (5bits)
= (1.0, . . . , 1.52)
Q2,2
ˆ = (1.0, . . . , 1.50)

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à ICML2013読み会 Large-Scale Learning with Less RAM via Randomization

Similaire à ICML2013読み会 Large-Scale Learning with Less RAM via Randomization (20)

Plus de Hidekazu Oiwa

Plus de Hidekazu Oiwa (12)

Dernier

Dernier (20)

ICML2013読み会 Large-Scale Learning with Less RAM via Randomization