Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Deep Learning for
New User Interactions
(Gestures, Speech and Emotions)
Olivia Klose, Software Development Engineer, Microsoft
Dr. Marcel Tilly, Program Manager, Microsoft

https://www.technologyreview.com/lists/technologies/2013/

Deep Neural Networks
… is inspired by the neural network in the brain
# of Neurons in the brains (~100 billion)
= # of Trees in the Amazon Rainforest (~ 300 billion)
# of Synapses (~ 100 - 1000 trillion)
= # of Leaves in the Amazon Rainforest

https://www.youtube.com/watch?v=V1eYniJ0Rnk

Scale in
Compute
Scale in
Data
Better
Algorithms
More
Investment

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
WER %
Improving
domain
knowledge

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
WER %
stuck

0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
WER %
Deep learning
+ Big Data
+ scalable
tools

http://arxiv.org/abs/1609.03528
http://blogs.microsoft.com/next/2016/10/18/historic-achievement-microsoft-researchers-reach-human-parity-conversational-speech-recognition

Speech Recognition Breakthrough for the Spoken, Translated Word

Skype Translator
Skype
Translator
Bots
Skype Service
Automatic Speech Recognition
Speech Correction
Translation
Text To Speech

Skype Translator
Skype
Translator
Bots
Skype Service
Speech Correction
Translation
Text To Speech
Software “robots”
Separate and manage
audio streams

Skype Translator
Skype
Translator
Bots
Skype Service
Speech Correction
Translation
Text To Speech
• Machine Learning
• Deep Neural Network
• New language = new training
this is
hum pig

Skype Translator
Skype
Translator
Bots
Skype Service
Speech Correction
Translation
Text To Speech
this is
hum pig
• Punctuation
• Capitalization
• Disfluency removal
• Lattice Rescoring
this is
hum pig.
This is
hum pig.
This is
pig.
This is
big.

Skype Translator
Skype
Translator
Bots
Skype Service
Speech Correction
Translation
Text To Speech
this is
hum pig
this is
hum pig.
This is
hum pig.
This is
pig.
This is
big.

Skype Translator
Skype
Translator
Bots
Skype Service
Speech Correction
Translation
Text To Speech
this is
hum pig
C’est
grand.
this is
hum pig.
This is
hum pig.
This is
pig.
This is
big.
• Microsoft Translator core API
• Statistical Machine Translation
• 45 supported languages

Skype Translator
Skype
Translator
Bots
Skype Service
Speech Correction
Translation
Text To Speech
Microsoft Translator TTS API
this is
hum pig
C’est
grand.
this is
hum pig.
This is
hum pig.
This is
pig.
This is
big.

Skype Translator
Skype
Translator
Bots
Skype Service
Speech Correction
Translation
Text To Speech
this is
hum pig
C’est
grand.
this is
hum pig.
This is
hum pig.
This is
pig.
This is
big.

front view top viewside viewinput depth inferred body parts
(no tracking or smoothing)
https://www.microsoft.com/en-us/research/video/real-time-human-pose-recognition-in-parts-from-single-depth-images-2/

https://www.microsoft.com/en-us/research/video/handpose-fully-articulated-hand-tracking/

bicycle
road
building
road
cat
road
building
car
grass
water
cow
https://www.microsoft.com/en-us/research/publication/semantic-segmentation-as-image-representation-for-scene-recognition/

28,2
25,8
16,4
11,7
7,3 6,7
5,1
3.5
ILSVRC 2010
NEC America
ILSVRC 2011
Xerox
ILSVRC 2012
AlexNet
ILSVRC 2013
Clarifi
ILSVRC 2014
VGG
ILSVRC 2014
GoogleNet
Human
Performance
ILSVRC 2015
ResNet
ImageNet Classification top-5 error (%)
Microsoft researchers win ImageNet computer vision challenge

11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
AlexNet,
8 layers
(ILSVRC
2012)
3x3 conv, 64
3x3 conv, 128
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc, 4096
fc, 4096
fc, 1000
VGG, 19
layers
(ILSVRC
2014)
input
Conv
7x7+ 2(S)
MaxPool
3x3+ 2(S)
LocalRespNorm
Conv
1x1+ 1(V)
Conv
3x3+ 1(S)
LocalRespNorm
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool
7x7+ 1(V)
FC
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv
1x1+ 1(S)
FC
FC
soft max1
soft max2
GoogleNet, 22
layers
(ILSVRC 2014)
ResNet, 152 layers
(ILSVRC 2015)
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2

Open-source, cross-platform toolkit for learning and evaluating
deep neural networks.
Expresses (nearly) arbitrary neural networks by composing simple
building blocks into complex computational networks
Production-ready: State-of-the-art accuracy, efficient, and scales to
multi-GPU/multi-server. http://cntk.ai

O
P(1)
X
W(1), b(1)
W(2), b(2)
S(1)
Sigmoid
P(2)
Softmax
Hidden
Layer
Output
Layer

B1=Parameter(HDim)
W1=Parameter(HDim, SDim)
X=Input(SDim)
labels=Input(LDim)
T1=Times(W1, X)
P1=Plus(T1, B1)
S1=Sigmoid(P1)
B2=Parameter(LDim, 1)
W2=Parameter(LDim, HDim)
T2=Times(W2, S1)
P2=Plus(T2, B1)
CrossEntropy=CrossEntropyWithSoftmax(labels, P2)
ErrPredict=ErrorPrediction(labels, P2)
FeatureNodes=(X)
LabelNodes=(labels)
CriteriaNodes=(CrossEntropy)
EvalNodes=(ErrPredict)
OutputNodes=(P2)

https://github.com/azure/ObjectDetectionUsingCntk

Computer Vision API
Content of Image:
Categories v0: [{ “name”: “animal”, “score”: 0.9765625 }]
V1: [{ "name": "grass", "confidence": 0.9999992847442627 },
{ "name": "outdoor", "confidence": 0.9999072551727295 },
{ "name": "cow", "confidence": 0.99954754114151 },
{ "name": "field", "confidence": 0.9976195693016052 },
{ "name": "brown", "confidence": 0.988935649394989 },
{ "name": "animal", "confidence": 0.97904372215271 },
{ "name": "standing", "confidence": 0.9632768630981445 },
{ "name": "mammal", "confidence": 0.9366017580032349,
"hint": "animal" },
{ "name": "wire", "confidence": 0.8946959376335144 },
{ "name": "green", "confidence": 0.8844101428985596 },
{ "name": "pasture", "confidence": 0.8332059383392334 },
{ "name": "bovine", "confidence": 0.5618471503257751,
"hint": "animal" },
{ "name": "grassy", "confidence": 0.48627158999443054 },
{ "name": "lush", "confidence": 0.1874018907546997 },
{ "name": "staring", "confidence": 0.165890634059906 }]
Describe
0.975 "a brown cow standing on top of a lush green field“
0.974 “a cow standing on top of a lush green field”
0.965 “a large brown cow standing on top of a lush green field”

https://www.youtube.com/watch?v=R2mC-NUAmMk

marcel.tilly@microsoft.com olivia.klose@microsoft.com

Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (18)

Similar to Deep Learning for New User Interactions (Gestures, Speech and Emotions)

Similar to Deep Learning for New User Interactions (Gestures, Speech and Emotions) (20)

More from Olivia Klose

More from Olivia Klose (7)

Recently uploaded

Recently uploaded (20)

Deep Learning for New User Interactions (Gestures, Speech and Emotions)