AI at Stitch Fix 2017

AI at Stitch Fix
Interpretability & Variational Methods
Christopher Moody
@ Stitch Fix

About
@chrisemoody
Caltech Physics
PhD. in astrostats supercomputing
sklearn t-SNE contributor
Stitch Fix Algorithms Team
Insight Fellow
github.com/cemoody
Gaussian Processes t-SNE
chainer
deep learning
Tensor Decomposition

Model: Doc vector
“ITEM_92 I think this fabric is wonderful (rayon &
spandex). like the lace/embroidery accents”
Co-occurrence
modeling

Model: Doc vector
Co-occurrence
modeling
“ITEM_92 I think this fabric is wonderful (rayon &
spandex). like the lace/embroidery accents”
!Embed both
words & items in
the same space!

Model: Doc vector
“ITEM_92 think fabric wonderful rayon
spandex like lace embroidery accents”
Co-occurrence
modeling
w
c
X[c, w] += 1

Model: Doc vector
Co-occurrence
modeling
c w
X[c, w] += 1

Model: Doc vector
w
Co-occurrence
modeling
c
X[c, w] += 1

Model: Doc vector
w
Co-occurrence
modeling
w w w w w
c w ww
X[c, w] = count

Model: Doc vector
w
Co-occurrence
modeling
w w w w w
cw ww
X[c, w] = count

Model: Doc vector
Co-occurrence
modeling
log(X[c, w]) = r[c] + r[w] + c w
known unknown
See also:
‘Glove vectors’
http://nlp.stanford.edu/
projects/glove/

Model: Doc vector
Co-occurrence
modeling
log(X[like, embroidery])
= r[like] + r[embroidery] + like embroidery
How frequent is like?
See also:
‘Glove vectors’
projects/glove/

Model: Doc vector
Co-occurrence
modeling
log(X[spandex, rayon])
= r[spandex] + r[rayon] + spandex rayon
See also:
‘Glove vectors’
projects/glove/
How similar are spandex &rayon ?

Model:
Interpret:
Doc vector
t-SNE
See also: ‘How
to use t-SNE
effectively’
http://distill.pub/2016/
misread-tsne/

Model:
Interpret:
Doc vector
t-SNE
See also: ‘How
to use t-SNE
effectively’
misread-tsne/ This distance means nothing!

Model:
Interpret:
Doc vector
t-SNE
See also: ‘How
to use t-SNE
effectively’
misread-tsne/
More exotic
skinnies?

Model:
Interpret:
Doc vector
t-SNE
See also: ‘How
to use t-SNE
effectively’
misread-tsne/
More
colorful
jeans?
Lighter
jeans?

Model:
Interpret:
Doc vector
Linearities
See also: ‘A word
is worth a thousand
vectors’
http://
multithreaded.stitchﬁx.com

Model:
Interpret:
Doc vector
Linearities
But… what are my model’s ‘directions’?
See also: ‘A word
is worth a thousand
vectors’
http://

Model:
Interpret:
Doc vector
k-SVD
See also:
‘Decoding the
thought vector’
http://gabgoh.github.io/
ThoughtVectors/
-1 - 1 - 5=
…and after k-SVD:
- - -1 - 1 - 5
…
=
+0.53
(Atom 23)
+0.16
(Atom 95)

Model:
Interpret:
Doc vector
k-SVD
See also:
‘Decoding the
thought vector’
ThoughtVectors/
-1 - 1 - 5=
…and after k-SVD:
- - -1 - 1 - 5
…
=
+0.53
(Atom 23)
“Tank top”
+0.16
(Atom 95)
“Exposed
shoulder”

Model:
Interpret:
Doc vector
k-SVD
See also:
‘Decoding the
thought vector’
ThoughtVectors/
= +0.53 +0.16
“Tank top”
“Exposed
shoulder”

Model:
Interpret:
Doc vector
k-SVD
See also:
‘Decoding the
thought vector’
ThoughtVectors/
“Dress”
(Atom 1)

Model:
Interpret:
Doc vector
k-SVD
See also:
‘Decoding the
thought vector’
ThoughtVectors/
“Urban/bohemian”
(Atom 40)

Model:
Interpret:
Doc vector
k-SVD
See also:
‘Decoding the
thought vector’
ThoughtVectors/
“Statement pieces”
(Atom 22)

Model:
Interpret:
Doc vector
k-SVD
See also:
‘Decoding the
thought vector’
ThoughtVectors/
“Ring & Drop
Earrings”
(Atom 75)

Model:
Interpret:
lda2vec
lda2vec
See also: ‘lda2vec’
http://
/blog/2016/05/27/lda2vec/
word2vec + LDA =

Variational
Methods2
…or what my model doesn’t know.

Variational
Methods
Practical reasons to go variational:
1. Alternative regularization
2. Measure what your model doesn’t know.
3. Help explain your data.

Variational
Methods
Practical reasons to go variational:
1. Alternative regularization
2. Measure what your model doesn’t know.
3. Help explain your data.
4. Short & ﬁts in a tweet!

Variational
Word Vectors
log(X[c, w]) = r[c] + r[w] + c w
How similar are c &w ?
How frequent is c?
How frequent is w?

Variational
Word Vectors
log(X[c, w]) = r[c] + r[w] + c w
Let’s make this variational:
1. Replace point estimates with
samples from a distribution.
2. Replace regularizing that point,
regularize that distribution.

Replace point estimates with samples
from a distribution.
embeddings
c_vector
Without variational
embeddings = nn.Embedding(n_words, n_dim)
...
c_vector = embeddings(c_index)
#1

With variational
mean
variance
#1
embeddings_mu = nn.Embedding(n_words, n_dim)
embeddings_lv = nn.Embedding(n_words, n_dim)
...
vector_mu = embeddings_mu(c_index)
vector_lv = embeddings_lv(c_index)
c_vector = normal_sample(vector_mu, vector_lv)
vector_mu
embeddings_mu
vector_lv
embeddings_lv

With variational
+0.32
+0.49
-0.21
+0.03
…
sample
#1
...

vector_mu
embeddings_mu
With variational
vector_lv
embeddings_lv
...
def normal_sample(mu, lv):
variance = sqrt(exp(lv))
sample = mu + N(0, 1) * variance
return sample
#1

Replace regularizing a point with
regularizing the distribution
Without variational
loss += c_vector.pow(2.0).sum()
#2

With variational
loss += kl_divergence(vector_mu, vector_lv)
#2
Prior
N(μ, σ)

With variational
loss += kl_divergence(vector_mu, vector_lv)
#2

With variational
...
def normal(mu, lv):
random = torch.normal(std.size())
return mu + random * torch.exp(0.5 * lv)
c_vector = normal(vector_mu, vector_lv)
vector_mu
embeddings_mu
vector_lv
embeddings_lv
.
.
.
.

See also:
‘word2gauss’
Bach

See also:
‘word2gauss’
Bach Composer

See also:
‘word2gauss’
Bach Composer
Classical

See also:
‘word2gauss’
Math
Bach Composer
Classical

Variational
Methods2
…where we’ll make variational versions of:
1. word2vec
2. Factorization Machines
3. t-SNE

Linear Regression
(with 2nd order
interactions)
Sums over all pairs of features
(known and observed)
1 coefﬁcient for each feature
(unknown, to be estimated)

Regression with Factorized Interactions
https://github.com/cemoody/vfm
Factorization
Machines

Regression with Factorized Interactions
=
Factorization
Machines

Regression with Variational
Factorized Interactions
Variational
Factorization
Machines

Variational
Factorization
Machines

Can write out uncertainty of prediction!https://github.com/cemoody/vfm
Variational
Factorization
Machines

Variational
Methods2
…where we’ll make variational versions of:
1. word2vec
2. Factorization machines
3. t-SNE

t-SNE
Input: N D-dimensional vectors

Input: N D-dimensional vectors
t-SNE
-2.0 -1.5 -0.11.9 -0.2 1.5 -0.5 5.0

100D+ Input
Pairwise
Probabilities
t-SNE

100D+ Input
Pairwise
Probabilities
2D Output
t-SNE

100D+ Input
Pairwise
Probabilities
Pairwise
Probabilities
2D Output
t-SNE

Form matrix of pairwise probabilities
pij = p(choosing i given j from all points)
t-SNE

t-SNE
Form matrix of pairwise distances
p = p(choosing given from )

t-SNE
pij =

p =
t-SNE

p =
t-SNE
-2.0 -11.9 -0.2 1.5 -0.5 5.0
1.02.7 -2.0 1.0 1.3

100D+ Input
Pairwise
Probabilities
Pairwise
Probabilities
2D Output
t-SNE
✔ ✔

q =
SNE
How to form pairwise matrix in 2D space?

q =
SNE
2.7 -2.0
How to form pairwise matrix in 2D space?
1.9 -0.2

q =
SNE
Using a Gaussian to
convert distances into probabilities …

q =
SNE
Using a Gaussian to
Bad for outliers!
(As we match high D with low D, get lots of outliers)

q =
Using a Gaussian to
Bad for outliers!
(As we match high D with low D, get lots of outliers)
Use Student’s t-distribution
(Heavy-tailed distribution)t-SNE

100D+ Input
Pairwise
Probabilities
Pairwise
Probabilities
2D Output
t-SNE
✔ ✔
✔ ✔

100D+ Input
Pairwise
Probabilities
Pairwise
Probabilities
2D Output
t-SNE
✔ ✔
✔ ✔q
p

Pairwise
Probabilities
Pairwise
Probabilities
t-SNE
✔
✔
KL Divergence
KL( || )
q
p q

t-SNE
github.com/cemoody/topicsne

100D+ Input
Pairwise
Probabilities
Pairwise
Probabilities
2D Output
Variational
t-SNE

t-SNE on MNIST Variational t-SNE on MNIST

t-SNE on MNIST Variational t-SNE on MNIST
real?
not confident.very confident!
not confident.
real?

100D+ Input
Pairwise
Probabilities
Pairwise
Probabilities
2D Output
“Topic”
t-SNE
Stay tuned!
Instead of each output point
having an (x,y)
Each point loads on to a
psuedo-discrete topic / cluster
using the Gumbel-Softmax
trick!

Adversarial
Text to Image
“blue razorback tank top”
(still training)
Wasserstein GANVar W2V
Conv Decoder

1
trick
#
declarative
model = Sequential()
model.add(Embedding(max_features, 128))
# try using a GRU instead, for fun
model.add(LSTM(128, 128))
model.add(Dropout(0.5))
model.add(Dense(128, 1))
model.add(Activation('sigmoid'))
# try using different optimizers
# and different optimizer configs
model.compile(loss='binary_crossentropy',
optimizer='adam',
class_mode="binary")
print("Train...")
model.fit(X_train, y_train,
batch_size=batch_size,
nb_epoch=4,
validation_data=(X_test, y_test),
show_accuracy=True)
score, acc = model.evaluate(X_test, y_test,
batch_size=batch_size,
show_accuracy=True)
# Neural net architecture
x = chainer.Variable(x_data, volatile=not train)
t = chainer.Variable(y_data, volatile=not train)
h0 = model.embed(x)
h1_in = model.l1_x(F.dropout(h0, train=train))
+ model.l1_h(state['h1'])
c1, h1 = F.lstm(state['c1'], h1_in)
h2_in = model.l2_x(F.dropout(h1, train=train))
+ model.l2_h(state['h2'])
c2, h2 = F.lstm(state['c2'], h2_in)
y = model.l3(F.dropout(h2, train=train))
state = {'c1': c1, 'h1': h1, 'c2': c2, 'h2': h2}
loss = F.softmax_cross_entropy(y, t)
imperative
compile
data
function

x = Variable(np.ones(10))
y = Variable(np.ones(10))
loss = x + y
The low level

symbolic variable
In [47]: loss.data
Out[47]: array([ 2., 2., 2., 2., 2., 2.]
x = t.vector(‘x’)
y = t.vector(‘y’)
loss = x + y
x = Variable(np.ones(10))
y = Variable(np.ones(10))
loss = x + y
symbolic + numeric variable
In [47]: loss
Out[47]: theano.tensor.var.TensorVariable

…and then something
goes wrong.

goes wrong.
…chainer computes everything
at run time… so debug &
investigate!

goes wrong.
In [47]: z.data
Out[47]: array([ 2., 2., 2., nan, 2., 2.]
…chainer computes everything
at run time… so debug &
investigate!

?@chrisemoody
Multithreaded
Stitch Fix
1. Use SVD instead of w2v
2. Use t-SNE for
interpreting your model
3. Use k-SVD for
interpreting your model
4. Add sparsity to your
models (e.g. as in
lda2vec)

AI at Stitch Fix 2017

Recommandé

Recommandé

Contenu connexe

Similaire à AI at Stitch Fix 2017

Similaire à AI at Stitch Fix 2017 (20)

Dernier

Dernier (20)

AI at Stitch Fix 2017