Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Deep Learning
and its Applica1on on
Speech Processing
Hung-yi	Lee
Spoken	
Content
Speech	
Recogni4on
Recogni4on	
Output
Speech	
Recogni,on
How	to	do	speech	recogni4on	with	
deep	learning?
...
People imagine ……
This	is	not	true!
DNN	can	only	take	fixed-length	
vectors	as	input	and	output.
“大家好 我今天 ….”
DNN
Input	and...
Recurrent Neural Network
x1
 x2
 x3
y1
y2
 y3
Wi
Wo
……
Wh
Wh
Wi
Wo
Wi
Wo
How	about	Recurrent	Neural	Network	(RNN)?
Recurrent Neural Network
好
 好
 好
Trimming	
棒
 棒
 棒
 棒
 棒
“好棒”
Why	can’t	it	be	
“好棒棒”
Input:
Output:
 (character	sequence)
...
Recurrent Neural Network
•  Connec4onist	Temporal	Classifica4on	(CTC)	[Alex	Graves,	
ICML’06][Alex	Graves,	ICML’14][Haşim	S...
Sequence-to-sequence Learning
•  Sequence	to	sequence	learning:	Both	input	and	output	are	
both	sequences	with	different	le...
Sequence-to-sequence Learning
•  Sequence	to	sequence	learning:	Both	input	and	output	are	
both	sequences	with	different	le...
Sequence-to-sequence Learning
•  Sequence	to	sequence	learning:	Both	input	and	output	are	
both	sequences	with	different	le...
Spoken	
Content
Speech	
Recogni4on
Recogni4on	
Output
Retrieval	
Retrieval	
Result
Spoken	Content	
Retrieval
People think ……
l Transcribe spoken content into text by speech recognition
Speech
Recognition Models
Text
Retrieval
Resul...
People think ……
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=
•  Good spoken content retrieval needs good speech recognition
system.
•  In real application, such high quality recogniti...
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Retrieval	
Result
Spoken	Content	
Ret...
Beyond Cascading Speech
Recogni1on and Text Retrieval
•  5	direc4ons	
•  Modified	Speech	Recogni4on	for	Retrieval	Purposes	...
Our Point
Spoken Content Retrieval
Speech Recognition
+
Text Retrieval
=
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Retrieval	
Result
Interac4on	
user
In...
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Retrieval	
Result
...
Unsupervised Learning
•  Machine	reads	lots	of	text	on	the	Internet	……
蔡英文 520宣誓就職
馬英九 520宣誓就職
蔡英文、馬英九 are	
something	very...
Seman1c Analysis
•  Let	machine	read	lots	of	documents.		
•  Each	word	is	represented	as	a	vector
dog
cat
rabbit
jump
run
...
Seman1c Analysis
•  Even	the	distances	between	the	vectors	have	some	
meaning.
Source:	hfp://
www.slideshare.net/hustwj/ci...
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Key	Term	
Extrac4o...
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Key	Term	
Extrac4o...
Speech Summariza1on
Retrieved
Audio File
Summary
Select the most informative
segments to form a compact version
1 hour lon...
Speech Summariza1on
•  用自己的話寫 summary	(Abstrac4ve	Summaries)	
•  Machine	learns	to	do	abstrac4ve	summariza4on		
from	2,000...
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Key	Term	
Extrac4o...
Spoken	
Content
Speech	
Recogni4on
Beyond	
Cascading
?
Recogni4on	
Output
Retrieval	
Seman4c	
Analysis	
Key	Term	
Extrac4o...
Outline
Very	Brief	Introduc4on	of	Deep	Learning
Towards	Machine	Comprehension		
of	Spoken	Content
•  Overview
•  Example	I...
Speech Ques1on Answering 
•  Machine	answers	ques4ons	based	on	the	
informa4on	in	spoken	content
What	is	a	possible	origin...
Speech Ques1on Answering 
•  TOEFL	Listening	Comprehension	Test	by	Machine	
•  Example:
Ques4on:	“	What	is	a	possible	orig...
Simple Baselines
Accuracy	(%)
(1)
 (2)
 (3)
 (4)
 (5)
 (6)
 (7)
Naive	Approaches
random
(4)	選 seman4c	和其他
選項最像的選項 
(2)	sel...
Supervised Learning
Accuracy	(%)
(1)
 (2)
 (3)
 (4)
 (5)
 (6)
 (7)
Memory	Network:	39.2%
Naive	Approaches
Interspeech	2016...
Model Architecture
	“what	is	a	possible	
origin	of	Venus
Ques4on:
Ques4on	
Seman4cs
……	It	be	quite	possible	that	this	be	
...
Model Architecture
Word-based	Afen4on
Model Architecture
Sentence-based	Afen4on
(A)
(A)
 (A)
 (A)
 (A)
(B)
 (B)
 (B)
Supervised Learning
Accuracy	(%)
(1)
 (2)
 (3)
 (4)
 (5)
 (6)
 (7)
Memory	Network:	39.2%
Naive	Approaches
Word-based	Afen4...
Outline
Very	Brief	Introduc4on	of	Deep	Learning
Towards	Machine	Comprehension		
of	Spoken	Content
•  Overview
•  Example	I...
Interact with Users
•  Interac4ve	retrieval	is	helpful.
user
“深度學習”
和機器學習有關的
”深度學習” 嗎?
還是和教育有關的
”深度學習” 呢?
Audio is hard to browse
•  When	the	system	returns	the	retrieval	results,	user	
doesn’t	know	what	he/she	get	at	the	first	g...
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
Directly	showing	...
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
“Give me an examp...
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
State	
Es4ma4on
A...
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
features
…
……
DNN...
user
Spoken	Content	
Retrieval
Results
Spoken	
Content	
Interac,ve	
retrieval		
of	spoken	content	
query
features
…
……
DNN...
Deep Reinforcement Learning
Experimental Results
•  Broadcast	news,	seman4c	retrieval	
Retrieval	Quality	(MAP)	
Op4miza4on	Target:	
Retrieval	Quality	...
Experimental Results
Outline
Very	Brief	Introduc4on	of	Deep	Learning
Towards	Machine	Comprehension		
of	Spoken	Content
•  Overview
•  Example	I...
Unsupervised Learning
Machine	listens	to	lots	
of	audio	book

(TA: )
Audio	Word2Vec:	Unsupervised	Learning	of	Audio	
Segme...
Audio Word to Vector
•  Consider	audio	segment	corresponding	to	an	
unknown	word	
Deep	
Learning
with
(助教:沈家豪)
Audio Word to Vector
•  The	audio	segments	corresponding	to	words	with	
similar	pronuncia4ons	are	close	to	each	other.
Dee...
Audio Word to Vector
•  The	audio	segments	corresponding	to	words	with	
similar	pronuncia4ons	are	close	to	each	other.
eve...
Sequence Auto-encoder
How to evaluate
never
ever
Cosine	
Similarity
Phoneme	sequence	
edit	distance
Deep	
Learning
Deep	
Learning
Experimental Results
More	similar	
pronuncia4on
Larger	cosine	
similarity.
Interes1ng Observa1on
•  Projec4ng	the	embedding	vectors	to	2-D
day
days
says
say
Spoken Content Retrieval without
Speech Recognition
user
“US President”
spoken query
[Hazen,	ASRU	09]	
[Zhang		Glass,	ASRU...
Spoken Content Retrieval without
Speech Recognition
• Why spoken content retrieval without speech
recognition? 
•  Lots of...
Spoken Content Retrieval without
Speech Recognition
Retrieval Performance
Concluding Remarks
Very	Brief	Introduc4on	of	Deep	Learning
Towards	Machine	Comprehension		
of	Spoken	Content
•  Overview
•...
Thank You for Your Attention
Prochain SlideShare
Chargement dans…5
×

李宏毅/當語音處理遇上深度學習

9 052 vues

Publié le

現為臺大電機系助理教授,他的研究方向與興趣是以機器學習技術讓機器辨識並理解語音訊號的內容。以深度學習技術為基石,他正致力於語音數位內容搜尋、語音數位內容之自動化組織以及從語音數位內容擷取關鍵資訊等前瞻性研究,這些技術有很多的應用,例如:人機互動、問答系統、智慧型線上教學平台等等。他曾在臺大開設和深度學習相關的課程「機器學習及其深層與結構化」。

Publié dans : Données & analyses
  • Identifiez-vous pour voir les commentaires

李宏毅/當語音處理遇上深度學習

  1. 1. Deep Learning and its Applica1on on Speech Processing Hung-yi Lee
  2. 2. Spoken Content Speech Recogni4on Recogni4on Output Speech Recogni,on How to do speech recogni4on with deep learning? Deep Learning
  3. 3. People imagine …… This is not true! DNN can only take fixed-length vectors as input and output. “大家好 我今天 ….” DNN Input and output are sequences with different lengths.
  4. 4. Recurrent Neural Network x1 x2 x3 y1 y2 y3 Wi Wo …… Wh Wh Wi Wo Wi Wo How about Recurrent Neural Network (RNN)?
  5. 5. Recurrent Neural Network 好 好 好 Trimming 棒 棒 棒 棒 棒 “好棒” Why can’t it be “好棒棒” Input: Output: (character sequence) (vector sequence ) Problem? How about Recurrent Neural Network (RNN)? 0.01s
  6. 6. Recurrent Neural Network •  Connec4onist Temporal Classifica4on (CTC) [Alex Graves, ICML’06][Alex Graves, ICML’14][Haşim Sak, Interspeech’15][Jie Li, Interspeech’15][Andrew Senior, ASRU’15] 好 φ φ 棒 φ φ φ φ 好 φ φ 棒 φ 棒 φ φ “好棒” “好棒棒” Add an extra symbol “φ” represen4ng “null”
  7. 7. Sequence-to-sequence Learning •  Sequence to sequence learning: Both input and output are both sequences with different lengths. Containing all informa4on about input uferance …… …… “機器學習” acous4c feature sequence → character sequence
  8. 8. Sequence-to-sequence Learning •  Sequence to sequence learning: Both input and output are both sequences with different lengths. …… …… “機器學習” 機 習 器 學 …… …… 慣 性 Don’t know when to stop
  9. 9. Sequence-to-sequence Learning •  Sequence to sequence learning: Both input and output are both sequences with different lengths. …… …… “機器學習” 機 習 器 學 Add a symbol “。 “ (句點) [Ilya Sutskever, NIPS’14][Dzmitry Bahdanau, arXiv’15] 。
  10. 10. Spoken Content Speech Recogni4on Recogni4on Output Retrieval Retrieval Result Spoken Content Retrieval
  11. 11. People think …… l Transcribe spoken content into text by speech recognition Speech Recognition Models Text Retrieval Result Text Retrieval Query learner l Use text retrieval approach to search the transcriptions Spoken Content Black Box
  12. 12. People think …… Spoken Content Retrieval Speech Recognition + Text Retrieval =
  13. 13. •  Good spoken content retrieval needs good speech recognition system. •  In real application, such high quality recognition models are not available •  Ex, YouTube •  Different languages/accents •  Different recording environments •  Hope for spoken content retrieval •  Don’t completely rely on accurate speech recognition •  Accurate spoken content retrieval, even under poor speech recognition Problem?
  14. 14. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Retrieval Result Spoken Content Retrieval ¨  Is the cascading of speech recognition and text retrieval the only solution of spoken content retrieval?
  15. 15. Beyond Cascading Speech Recogni1on and Text Retrieval •  5 direc4ons •  Modified Speech Recogni4on for Retrieval Purposes •  Exploi4ng Informa4on not present in ASR outputs •  Directly Matching on Acous4c Level without ASR •  Seman4c Retrieval of Spoken Content •  Interac4ve Retrieval and Efficient Presenta4on of Retrieved Objects Overview paper "Spoken Content Retrieval —Beyond Cascading Speech Recogni4on with Text Retrieval" http://speech.ee.ntu.edu.tw/~tlkagk/paper/Overview.pdf
  16. 16. Our Point Spoken Content Retrieval Speech Recognition + Text Retrieval =
  17. 17. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Retrieval Result Interac4on user Interact with Humans
  18. 18. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Retrieval Result Interac4on user Seman,c Analysis
  19. 19. Unsupervised Learning •  Machine reads lots of text on the Internet …… 蔡英文 520宣誓就職 馬英九 520宣誓就職 蔡英文、馬英九 are something very similar You shall know a word by the company it keeps
  20. 20. Seman1c Analysis •  Let machine read lots of documents. •  Each word is represented as a vector dog cat rabbit jump run flower tree
  21. 21. Seman1c Analysis •  Even the distances between the vectors have some meaning. Source: hfp:// www.slideshare.net/hustwj/cikm- keynotenov2014
  22. 22. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Key Term Extrac4on Retrieval Result Interac4on user Key Term Extrac,on [Interspeech 2015] (with 沈昇勳)
  23. 23. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Key Term Extrac4on Retrieval Result Interac4on user Summariza,on Summari- za4on
  24. 24. Speech Summariza1on Retrieved Audio File Summary Select the most informative segments to form a compact version 1 hour long 10 minutes Extrac've Summaries Ref: http://speech.ee.ntu.edu.tw/ ~tlkagk/courses/MLDS_2015/ Structured%20Lecture/Summarization %20Hidden_2.ecm.mp4/index.html
  25. 25. Speech Summariza1on •  用自己的話寫 summary (Abstrac4ve Summaries) •  Machine learns to do abstrac4ve summariza4on from 2,000,000 training examples , , , , , ; …… Human Machine 台大電機系 盧柏儒、徐翊祥 台大資工系 葉正杰、周儒杰 (助教:余朗祺)
  26. 26. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Key Term Extrac4on Summari- za4on Ques4on- answering Retrieval Result Interac4on user question answer Ques,on Answering
  27. 27. Spoken Content Speech Recogni4on Beyond Cascading ? Recogni4on Output Retrieval Seman4c Analysis Key Term Extrac4on Summari- za4on Ques4on- answering Retrieval Result Interac4on user question answer Without Speech Recogni,on?
  28. 28. Outline Very Brief Introduc4on of Deep Learning Towards Machine Comprehension of Spoken Content •  Overview •  Example I: Speech Ques4on Answering •  Example II: Interac4ve Spoken Content Retrieval •  Example III: What can machine learn from audio without any supervision
  29. 29. Speech Ques1on Answering •  Machine answers ques4ons based on the informa4on in spoken content What is a possible origin of Venus’ clouds? ……… answer
  30. 30. Speech Ques1on Answering •  TOEFL Listening Comprehension Test by Machine •  Example: Ques4on: “ What is a possible origin of Venus’ clouds? ” Audio Story: Choices: (A) gases released as a result of volcanic activity (B) chemical reactions caused by high surface temperatures (C) bursts of radio energy from the plane's surface (D) strong winds that blow dust into the atmosphere (The original story is 5 min long.)
  31. 31. Simple Baselines Accuracy (%) (1) (2) (3) (4) (5) (6) (7) Naive Approaches random (4) 選 seman4c 和其他 選項最像的選項 (2) select the shortest choice as answer Experimental setup: 717 for training, 124 for validation, 122 for testing
  32. 32. Supervised Learning Accuracy (%) (1) (2) (3) (4) (5) (6) (7) Memory Network: 39.2% Naive Approaches Interspeech 2016 (with 曾柏翔) (proposed by FB AI group)
  33. 33. Model Architecture “what is a possible origin of Venus Ques4on: Ques4on Seman4cs …… It be quite possible that this be due to volcanic erup4on because volcanic erup4on o{en emit gas. If that be the case volcanism could very well be the root cause of Venus 's thick cloud cover. And also we have observe burst of radio energy from the planet 's surface. These burst be similar to what we see when volcano erupt on earth …… Audio Story: Speech Recogni4on Seman4c Analysis Seman4c Analysis Afen4on (畫重點) Answer Select the choice most similar to the answer Afen4on Similar to Memory Network
  34. 34. Model Architecture Word-based Afen4on
  35. 35. Model Architecture Sentence-based Afen4on
  36. 36. (A) (A) (A) (A) (A) (B) (B) (B)
  37. 37. Supervised Learning Accuracy (%) (1) (2) (3) (4) (5) (6) (7) Memory Network: 39.2% Naive Approaches Word-based Afen4on: 48.3% Interspeech 2016 (with 曾柏翔) (proposed by FB AI group)
  38. 38. Outline Very Brief Introduc4on of Deep Learning Towards Machine Comprehension of Spoken Content •  Overview •  Example I: Speech Ques4on Answering •  Example II: Interac4ve Spoken Content Retrieval •  Example III: What can machine learn from audio without any supervision
  39. 39. Interact with Users •  Interac4ve retrieval is helpful. user “深度學習” 和機器學習有關的 ”深度學習” 嗎? 還是和教育有關的 ”深度學習” 呢?
  40. 40. Audio is hard to browse •  When the system returns the retrieval results, user doesn’t know what he/she get at the first glance Retrieval Result
  41. 41. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query Directly showing the retrieval results is probably not a good idea.
  42. 42. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query “Give me an example.” “Is it relevant to XXX?” “Can you give me another query?” “Show the results.” Given the current situation, which action should be taken? ……
  43. 43. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query State Es4ma4on Ac4on Decision state The degree of clarity from the retrieval results ac4on features ¤  The policy π(s) is a function ¤  Input: state s, output: action a Decide the actions by intrinsic policy π(S) [Interspeech 2012][ICASSP 2013]
  44. 44. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query features … …… DNN State EstimationAction Decision Is it relevant to XXX? Give me an example. Show the results. Max
  45. 45. user Spoken Content Retrieval Results Spoken Content Interac,ve retrieval of spoken content query features … …… DNN Is it relevant to XXX? Give me an example. Show the results. Max Learned from historical interac4on Goal: maximizing return (Retrieval Quality - User labor)
  46. 46. Deep Reinforcement Learning
  47. 47. Experimental Results •  Broadcast news, seman4c retrieval Retrieval Quality (MAP) Op4miza4on Target: Retrieval Quality - User labor Hand-cra{ed Deep Learning Previous Method (state + decision) submifed to Interspeech 2016 (with 吳彥諶、林子翔)
  48. 48. Experimental Results
  49. 49. Outline Very Brief Introduc4on of Deep Learning Towards Machine Comprehension of Spoken Content •  Overview •  Example I: Speech Ques4on Answering •  Example II: Interac4ve Spoken Content Retrieval •  Example III: What can machine learn from audio without any supervision
  50. 50. Unsupervised Learning Machine listens to lots of audio book (TA: ) Audio Word2Vec: Unsupervised Learning of Audio Segment Representa'ons using Sequence-to-sequence Autoencoder (accepted by Interspeech 2016)
  51. 51. Audio Word to Vector •  Consider audio segment corresponding to an unknown word Deep Learning with (助教:沈家豪)
  52. 52. Audio Word to Vector •  The audio segments corresponding to words with similar pronuncia4ons are close to each other. Deep Learning
  53. 53. Audio Word to Vector •  The audio segments corresponding to words with similar pronuncia4ons are close to each other. ever ever never never never dog dog dogs Deep Learning
  54. 54. Sequence Auto-encoder
  55. 55. How to evaluate never ever Cosine Similarity Phoneme sequence edit distance Deep Learning Deep Learning
  56. 56. Experimental Results More similar pronuncia4on Larger cosine similarity.
  57. 57. Interes1ng Observa1on •  Projec4ng the embedding vectors to 2-D day days says say
  58. 58. Spoken Content Retrieval without Speech Recognition user “US President” spoken query [Hazen, ASRU 09] [Zhang Glass, ASRU 09] [Chan Lee, Interspeech 10] [Zhang Glass, ICASSP 11] [Gupta, Interspeech 11] [Zhang Glass, Interspeech 11] [Zhang Glass, ASRU 09] [Huijbregts, ICASSP 11] [Chan Lee, Interspeech 11] Computing similarity between spoken queries and audio files on signal level Spoken Content Handheld device
  59. 59. Spoken Content Retrieval without Speech Recognition • Why spoken content retrieval without speech recognition? •  Lots of audio files in different languages on the Internet •  Most languages have little annotated data for training speech recognition systems. •  Some audio files are produced in several different of languages •  Some languages even do not have text
  60. 60. Spoken Content Retrieval without Speech Recognition
  61. 61. Retrieval Performance
  62. 62. Concluding Remarks Very Brief Introduc4on of Deep Learning Towards Machine Comprehension of Spoken Content •  Overview •  Example I: Speech Ques4on Answering •  Example II: Interac4ve Spoken Content Retrieval •  Example III: What can machine learn from audio without any supervision
  63. 63. Thank You for Your Attention

×