1. The document discusses variational methods for interpreting and explaining machine learning models.
2. It describes replacing point estimates in models with samples from distributions and regularizing the distributions instead of the point estimates.
3. Variational word embeddings are proposed that represent words as distributions rather than points and regularize the distributions' means and variances.
7. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
Co-occurrence
modeling
w
c
X[c, w] += 1
8. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
Co-occurrence
modeling
w
c
X[c, w] += 1
9. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
Co-occurrence
modeling
w
c
X[c, w] += 1
10. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
Co-occurrence
modeling
w
c
X[c, w] += 1
11. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
Co-occurrence
modeling
w
c
X[c, w] += 1
12. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
Co-occurrence
modeling
c w
X[c, w] += 1
13. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
Co-occurrence
modeling
c w
X[c, w] += 1
14. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
w
Co-occurrence
modeling
c
X[c, w] += 1
15. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
Co-occurrence
modeling
c w
X[c, w] += 1
16. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
w
Co-occurrence
modeling
w w w w w
c w ww
X[c, w] = count
17. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
w
Co-occurrence
modeling
w w w w w
c w ww
X[c, w] = count
18. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
w
Co-occurrence
modeling
w w w w w
cw ww
X[c, w] = count
19. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
w
Co-occurrence
modeling
w w w w w
cw ww
X[c, w] = count
20. Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
w
Co-occurrence
modeling
w w w w w
cw ww
X[c, w] = count
49. Variational
Methods
Practical reasons to go variational:
1. Alternative regularization
2. Measure what your model doesn’t know.
3. Help explain your data.
4. Short & fits in a tweet!
50. Variational
Methods
Practical reasons to go variational:
1. Alternative regularization
2. Measure what your model doesn’t know.
3. Help explain your data.
4. Short & fits in a tweet!
52. Variational
Word Vectors
log(X[c, w]) = r[c] + r[w] + c w
Let’s make this variational:
1. Replace point estimates with
samples from a distribution.
2. Replace regularizing that point,
regularize that distribution.
53. Replace point estimates with samples
from a distribution.
embeddings
c_vector
Without variational
embeddings = nn.Embedding(n_words, n_dim)
...
c_vector = embeddings(c_index)
#1
54. Replace point estimates with samples
from a distribution.
With variational
mean
variance
#1
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
c_vector = normal_sample(vector_mu, vector_lv)
vector_mu
embeddings_mu
vector_lv
embeddings_lv
55. Replace point estimates with samples
from a distribution.
With variational
+0.32
+0.49
-0.21
+0.03
…
sample
#1
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
c_vector = normal_sample(vector_mu, vector_lv)
56. Replace point estimates with samples
from a distribution.
With variational
+0.32
+0.49
-0.21
+0.03
…
sample
#1
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
c_vector = normal_sample(vector_mu, vector_lv)
57. Replace point estimates with samples
from a distribution.
With variational
+0.32
+0.49
-0.21
+0.03
…
sample
#1
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
c_vector = normal_sample(vector_mu, vector_lv)
58. Replace point estimates with samples
from a distribution.
With variational
+0.32
+0.49
-0.21
+0.03
…
sample
#1
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
c_vector = normal_sample(vector_mu, vector_lv)
59. Replace point estimates with samples
from a distribution.
With variational
+0.32
+0.49
-0.21
+0.03
…
sample
#1
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
c_vector = normal_sample(vector_mu, vector_lv)
60. Replace point estimates with samples
from a distribution.
vector_mu
embeddings_mu
With variational
vector_lv
embeddings_lv
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
c_vector = normal_sample(vector_mu, vector_lv)
def normal_sample(mu, lv):
variance = sqrt(exp(lv))
sample = mu + N(0, 1) * variance
return sample
#1
61. Replace regularizing a point with
regularizing the distribution
Without variational
loss += c_vector.pow(2.0).sum()
#2
62. Replace regularizing a point with
regularizing the distribution
With variational
loss += kl_divergence(vector_mu, vector_lv)
#2
Prior
N(μ, σ)
63. Replace regularizing a point with
regularizing the distribution
With variational
loss += kl_divergence(vector_mu, vector_lv)
#2
64. Replace regularizing a point with
regularizing the distribution
With variational
loss += kl_divergence(vector_mu, vector_lv)
#2
65. Replace regularizing a point with
regularizing the distribution
With variational
loss += kl_divergence(vector_mu, vector_lv)
#2
66. Replace point estimates with samples
from a distribution.
With variational
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
def normal(mu, lv):
random = torch.normal(std.size())
return mu + random * torch.exp(0.5 * lv)
c_vector = normal(vector_mu, vector_lv)
vector_mu
embeddings_mu
vector_lv
embeddings_lv
.
.
.
.
67. Replace point estimates with samples
from a distribution.
With variational
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
def normal(mu, lv):
random = torch.normal(std.size())
return mu + random * torch.exp(0.5 * lv)
c_vector = normal(vector_mu, vector_lv)
vector_mu
embeddings_mu
vector_lv
embeddings_lv
.
.
.
.
68. Replace point estimates with samples
from a distribution.
With variational
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
def normal(mu, lv):
random = torch.normal(std.size())
return mu + random * torch.exp(0.5 * lv)
c_vector = normal(vector_mu, vector_lv)
vector_mu
embeddings_mu
vector_lv
embeddings_lv
.
.
.
.
75. Linear Regression
(with 2nd order
interactions)
Sums over all pairs of features
(known and observed)
1 coefficient for each feature
(unknown, to be estimated)
80. Regression with Variational
Factorized Interactions
Can write out uncertainty of prediction!https://github.com/cemoody/vfm
Variational
Factorization
Machines
97. q =
SNE
Using a Gaussian to
convert distances into probabilities …
98. q =
SNE
Using a Gaussian to
convert distances into probabilities …
Bad for outliers!
(As we match high D with low D, get lots of outliers)
99. q =
Using a Gaussian to
convert distances into probabilities …
Bad for outliers!
(As we match high D with low D, get lots of outliers)
Use Student’s t-distribution
(Heavy-tailed distribution)t-SNE
121. symbolic variable
In [47]: loss.data
Out[47]: array([ 2., 2., 2., 2., 2., 2.]
x = t.vector(‘x’)
y = t.vector(‘y’)
loss = x + y
x = Variable(np.ones(10))
y = Variable(np.ones(10))
loss = x + y
symbolic + numeric variable
In [47]: loss
Out[47]: theano.tensor.var.TensorVariable
125. …and then something
goes wrong.
…chainer computes everything
at run time… so debug &
investigate!
126. …and then something
goes wrong.
In [47]: z.data
Out[47]: array([ 2., 2., 2., nan, 2., 2.]
…chainer computes everything
at run time… so debug &
investigate!
127. ?@chrisemoody
Multithreaded
Stitch Fix
1. Use SVD instead of w2v
2. Use t-SNE for
interpreting your model
3. Use k-SVD for
interpreting your model
4. Add sparsity to your
models (e.g. as in
lda2vec)