Audio-Visual Speech Processing presentation at the IEEE conference on Advanced Technologies for Signal and Image Processing in Sousse, Tunisia, March 18th, 2014
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Atsip avsp17
1. Audio-Visual Speech Processing
Gérard Chollet
with Meriem Bendris, Hervé Bredin, Thomas Hueber,
Walid Karam, Rémi Landais, Patrick Perrot,
Eduardo Sanchez-Soto, Leila Zouari
ATSIP, Sousse, March 18th 2014
2. Page 2 ATSIP, Sousse, May 18th, 2014
Some motivations,…
■ A talking face is more intelligible, expressive,
recognisable, attractive than acoustic speech
alone.
■ The combined use of facial and speech
information improves identity verification and
robustness to forgeries.
■ Multi-stream models of the synchrony of visual
and acoustic information have applications in
the analysis, coding, recognition and synthesis
of talking faces.
■ SmartPhones, VisioPhones, WebPhones,
SecurePhones, Visio Conferences, Virtual
Reality worlds are gaining popularity.
3. Page 3 ATSIP, Sousse, May 18th, 2014
Some topics under study,…
■ Audio-visual speech recognition
– Automatic ‘lip-reading’
■ Audio-visual speaker verification
– Detection of forgeries
■ Speech driven animation of the face
– Could we look and sound like somebody else ?
■ Speaker indexing
– ‘Who is talking in a video sequence ?’
■ OUISPER : a silent speech interface
– Corpus based synthesis from tongue and lips
4. Page 4 ATSIP, Sousse, May 18th, 2014
Audio Visual Speech Recognition
Dictionary Grammar
Acoustic models
Features
extraction
Decoder
5. Page 5 ATSIP, Sousse, May 18th, 2014
Video Mike (IBM, 2004)
■ IBM
■ 2004
7. Page 7 ATSIP, Sousse, May 18th, 2014
Video processing
■ Video extraction
■ Lips localisation
■ Images interpolation
(same frequency as speech)
■ Features extraction
• DCT and DCT2 (DCT+LDA)
• Projections : PRO et PRO2
(PRO+LDA)
■ Recognition experiments
8. Page 8 ATSIP, Sousse, May 18th, 2014
Fusion techniques
q Parameters fusion :
• Concatenation
• Dimension decrease : Linear Discriminant Analysis (LDA)
• Modelisation : classical HMM with one stream
q Scores fusion : Multi-stream HMM
9. Page 9 ATSIP, Sousse, May 18th, 2014
Experimental results :
parameters fusion
0
10
20
30
40
50
60
70
80
90
100
-15 -10 -5 0 5 10S/N
%Accuracy
Speech only
Video only : Pro2
Video only : DCT2
AV Fusion : Pro2
AV Fusion : DCT2
10. Page 10 ATSIP, Sousse, May 18th, 2014
Experimental results :
Scores fusion at -5db
42
43
44
45
46
47
48
49
50
51
52
Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2
11. Page 11 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Fusion of face and speech for identity verification
■ Detection of possible forgeries
■ Compulsory ? for:
– Homeland/firms security: restricted access,…
– Secured computer login
– Secured on-line signature of contracts
12. Page 12 ATSIP, Sousse, May 18th, 2014
12
Talking-face and
2D face sequence database
■ Data: video sequences (.avi) in which a short phrase in English is
pronounced / duration ≈ 10s (actual speech duration ≈ 2s)
■ Audio-video data used for talking faces evaluations
■ Same sequences used for 2D face from video sequences evaluations
■ 430 subjects pronounced 4 phrases :
– from a set 430 English phrases
– 2 indoor video files acquired during the first session
– 2 outdoor video files acquired during the second session
– realistic forgeries created a posteriori
13. Page 13 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Features
Raw
Pixel
Value
DCT
Transform
Shape
Related
Many
Others
…
Raw
amplitude
« Classical »
MFCC coefficients
Many others
14. Page 14 ATSIP, Sousse, May 18th, 2014
Audio-Visual
Audio-Visual Subspaces
AudioVisual
Reduced
Audiovisual Subspace
Principal Component &
Linear Discriminant
Analysis
x
Correlated
Audio & Visual Subspaces
Co-inertia &
Canonical Correlation
Analysis
16. Page 16 ATSIP, Sousse, May 18th, 2014
Application to indexation
■ High-level requests
– “Find videos where John Doe is speaking”
– “Find dialogues between Mr X and Mrs Y”
– “Locate the singer in this music video”
Raw
Energy
Raw
Pixel
Value
Correlation
17. Page 17 ATSIP, Sousse, May 18th, 2014
Who is speaking?
■ Face tracking
■ Correlation
– Pixel of each face
– Raw audio energy
■ Find maximum synchrony
Green: current speaker
18. Page 18 ATSIP, Sousse, May 18th, 2014
How
to
Perform
“Talking-‐Face”
Authen:ca:on?
Face
recognition
Speaker verification
Score
fusion
What if…?
OK
OK OK
Deliberate imposture
19. Page 19 ATSIP, Sousse, May 18th, 2014
Biometrics
■ Identity Verification with Talking Faces
– Speaker Verification
– Face Recognition
■ What if?
Face
OK
Voice
OK
NO
X
20. Page 20 ATSIP, Sousse, May 18th, 2014
Identity Verification
Enrolment of client λ
Model for
client λ
Person ε pretending to be client λ
accepted if
rejected otherwise
Co-Inertia
Analysis
Equal Error Rate: 30 %
21. Page 21 ATSIP, Sousse, May 18th, 2014
Test
Replay Attacks Detection
Training
Co-IA
CCA
accepted if
rejected otherwise
Sync
Model
22. Page 22 ATSIP, Sousse, May 18th, 2014
Replay Attacks Detection
Genuine synchronized video Audio replay attack
Lips do not match audio perfectly
Equal Error Rate: 14 %
23. Page 23 ATSIP, Sousse, May 18th, 2014
Example of Replay attacks
24. Page 24 ATSIP, Sousse, May 18th, 2014
delayed video delayed audio
-5 0 +5
Alignment by maximum correlation
-1
25. Page 25 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Available features
– Face : Face features (lip, eyes) à Face Modality
– Speech à Speech Modality
– Speech Synchrony à Synchrony Modality
Video
26. Page 26 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Face modality
– Detection:
• Generative models (MPT toolbox)
• Temporal median Filtering
• Eyes detection within faces
– Normalization: geometry + illumination
27. Page 27 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Face Modality:
– Two verification strategies and one single
comparison framework
• Global = Eigenfaces:
– Calculation of a set of directions (eigenfaces)
defining a projection space
– Two faces are compared regarding their
projection on the eigenfaces space.
– Learning data: BIOMET (130 pers.) + BANCA
(30 pers.)
29. Page 29 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Face Modality:
• SVD-based matching method:
– Compare two videos V1 and V2
– Exclusive principle: One-to-one correspondences
between
» Faces (global)
» Descriptors (local)
– Principle:
» Proximity matrix computation between faces or
descriptors
» Extraction of good pairings (made easy by SVD
computation)
– Scores:
» One matching score between global
representations
» One matching score between local representations
31. Page 31 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Speech Modality:
– GMM-based approach;
• One world model
• Each speaker model is derived from the
World Model by MAP adaptation
• Speech verification score: derived from
likelihood ratio
32. Page 32 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Synchrony Modality:
– Principle: synchrony between lips and
speech carries identity information
– Process:
• Computation of a synchrony model (CoIA
analysis) for each person based on DCT
(visual signal) and MFCC (speech signal)
• Comparison of the test sample with the
synchrony model
33. Page 33 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■ Experiments:
– BANCA database:
• 52 persons divided into two groups (G1 and G2)
• 3 recording conditions
• 1 person à 8 recordings (4 client accesses, 4
impostor accesses)
• Evaluation based on P protocol: 234 client
accesses and 312 impostor accesses
– Scores:
• 4 scores per access (PCA face, SIFT face,
speech, synchrony)
• Score fusion based on RBF-SVM: hyperplan
learned on G1/tested on G2 and conversely)
35. Page 35 ATSIP, Sousse, May 18th, 2014
SecurePhone
■ Technical solution that improves security
■ Biometric recognition
– Makes use of VOICE, FACE and SIGNATURE
■ Electronic signature used to secure information exchange
36. Page 36 ATSIP, Sousse, May 18th, 2014
Biometrics in SecurePhone
■ Operation
Pre-processing
Modelling
Modelling
Modelling
Pre-processingPre-processing
Access DeniedAccess Granted
FUSION
Face Voice Written Signature
Modelling
37. Page 37 ATSIP, Sousse, May 18th, 2014
The BioSecure Multimodal Evaluation Campaign
■ Launched in April 2007
■ Many modalities including ‘Video sequences’ and
‘Talking Faces’
■ Development data and reference systems available
■ Evaluations on the sequestrated BioSecure data base
(1000 clients)
■ Debriefing workshop
■ More info on :
http://www.int-evry.fr/biometrics/BMEC2007/index.php
38. Page 38 ATSIP, Sousse, May 18th, 2014
Audio-‐visual
forgery
scenarios
■ Low-‐effort
– “Paparazzi”
scenario
• The
impostor
owns
a
picture
of
the
face
and
a
recording
of
the
voice
of
the
target
– “Big
Brother”
scenario
• The
impostor
owns
a
video
of
the
face
and
a
recording
of
the
voice
of
the
target
■ High-‐effort
– “Imitator”
scenario
• The
impostor
owns
a
recording
of
the
voice
of
the
target
and
transforms
his
own
voice
to
sound
like
the
target
– “Playback”
scenario
• The
impostor
owns
a
picture
of
the
face
of
the
target
and
animate
it
according
to
his
own
face
moAon
– “Ventriloquist”
scenario
• combines
the
two
previous
ones
39. Page 39 ATSIP, Sousse, May 18th, 2014
Detec:on
of
imposture
Face modality:
ACCEPTED!
Voice modality:
ACCEPTED!
Synchronisation:
DENIED!
40. Page 40 ATSIP, Sousse, May 18th, 2014
40
Audio replay + “random” face
Talking-Face forgeries @ BMEC
Audio replay attack
" Assumptions
§ Forger has recorded speech data from the genuine
user in outdoor (test) conditions
§ Forger is replaying the audio and uses his face in
front of the sensor
Stolen wave Audio replay + forger face
41. Page 41 ATSIP, Sousse, May 18th, 2014
41
CRAZY TALK Face animation + TTS
Talking-Face forgeries @ BMEC
Replay attack
" Assumptions
§ Forger has stolen a picture
§ Forger uses a face animation software and TTS (male or
female)
§ Forger plays back the animation to the sensor
Stolen picture Contour detection Generated avi
42. Page 42 ATSIP, Sousse, May 18th, 2014
42
Picture presentation + TTS forgeries
Talking-Face forgeries @ BMEC
Replay attack
" Assumptions
§ Forger has stolen a picture
§ Forger has printed the picture
§ Forger present the picture to the sensor and uses TTS
(same wave as for the face animation forgery)
Stolen picture
Presented picture
43. Page 43 ATSIP, Sousse, May 18th, 2014
43
Systems with fusion of
(face, speech)
face
score
speech
score
fusion
score
video sequence
frames
speech signal
Face verification
Speaker verification
44. Page 44 ATSIP, Sousse, May 18th, 2014
44
Voice Conversion methods
■ GMM
conversion
– Training
of
a
joined
Gaussian
model
•
parallel
corpus
of
aligned
sentences
of
both
source
and
target
voice
•
MFCC
on
HNM
(Harmonic
plus
Noise
Model)
parameterizaAon
– Speech
synthesis
from
Gaussian
model
•
Inversion
of
the
MFCC
•
Pitch
correcAon
■ ALISP
conversion
– Very
low
debit
speech
compression
(500
bps)
method
•
Originally
developed
by
TELECOM-‐ParisTech
– Indexed
segments
dicAonary
system
(of
the
target
voice)
– HNM
parameterizaAon
45. Page 45 ATSIP, Sousse, May 18th, 2014
Voice conversion techniques
Definition: Process of making one person’s voice « source » sounds like another
person’s voice target
source target
Voice conversion
My name is John My name is John
46. Page 46 ATSIP, Sousse, May 18th, 2014
Principle of ALISP
Dictionary of
representative
segments
Dictionary of
representative
segments
Spectral analysis
Prosodic analysis
Selection of
segmental units
Segment
index
Prosodic
parameters
Input speech
Concatenative
synthesis
HNM
Output speech
CODER
47. Page 47 ATSIP, Sousse, May 18th, 2014
Details of Encoding
speech Spectral
analysis
Prosodic
analysis
HMM
Recognition
Dictionary of
HMM models of
ALISP classes
Synth unit A1
…
Synth unit A8
HMM A
Representative
units of the
class
Selection by
DTW
Prosodic
encoding
Index of
ALISP class
Index of
synth. unit
Pitch,
energy,
duration
48. Page 48 ATSIP, Sousse, May 18th, 2014
Details of decoding
Output speech
Synth unit A1
…
Synth unit A8
ALISP Index
Synth unit index
within class
Prosodic parameters
Loading
Synth unit
Concatenative
synthesis
49. Page 49 ATSIP, Sousse, May 18th, 2014
Principle of Alisp conversion
Learning step: one hour of target voice
- Parametric analysis: MFCC
- Segmentation based on temporal decompostion and vector quantization
- Stochastic modelling based on HMM
- Creation of representative units
Conversion step
- Parametric analysis: MFCC
- HMM recognition
- Selection of representative segment à DTW
Synthesis step
- Concatenation of representative
- HNM synthesis
50. Page 50 ATSIP, Sousse, May 18th, 2014
Voice conversion using ALISP
results
BREF databaseNIST database
Source
Result
TargetSource Target
Result
female female female male
51. Page 51 ATSIP, Sousse, May 18th, 2014
Demonstra:on
of
Voice
Conversion
Impostor voice Converted voice with GMM Converted voice with ALISP
Target voiceConverted voice with ALISP+GMM
52. Page 52 ATSIP, Sousse, May 18th, 2014
3D reconstruction
• 3D face modeling from a front and a profile shot :
• Animated face
• https://picoforge.int-evry.fr/cgi-bin/twiki/view/
Myblog3d/Web/Demos
53. Page 53 ATSIP, Sousse, May 18th, 2014
Face Tranformation
Control point
selection
Image
segmentation
Figure
2:
Division
of
an
image
Figure
1:
Control
points
selec8on
Linear
transformation
between source
and target image
Blending
step
source
target
54. Page 54 ATSIP, Sousse, May 18th, 2014
Face Transformation
Source
?
54
-‐>
LocalisaAon
of
control
points
-‐>
Warping
-‐>
Blending
Cible
?
X’
=
f(X)
p
=
αp
+
(1
–
α)p’
X
X’
p
p’
55. Page 55 ATSIP, Sousse, May 18th, 2014
Face
transforma:on
(IBM)
56. Page 56 ATSIP, Sousse, May 18th, 2014
Ouisper1 - Silent Speech Interface
■ Sensor-based system allowing speech communication via
standard articulators, but without glottal activity
■ Two distinct types of application
– alternative to tracheo-oesophagal speech (TES) for persons
having undergone a tracheotomy
– a "silent telephone" for use in situations where quiet must be
maintained, or for communication in very noisy environments
■ Speech Synthesis from ultrasound
and optical imagery of the tongue and lips
1) Oral Ultrasound synthetIc SPEech souRce
57. Page 57 ATSIP, Sousse, May 18th, 2014
Ouisper - System Overview
Ultrasound video
of the vocal tract
Optical video
of the speaker lips
Recorded
audio
Speech Alignment
Text
Visual Feature
Extraction
Audio-Visual Speech
Corpus
Visual Speech
Recognizer
Visual Unit
Selection
Audio Unit
Concatenation
T
R
A
I
N
I
N
G
T
E
S
T
Visual Data
N-best
Phonetic
or ALISP
Targets
58. Page 58 ATSIP, Sousse, May 18th, 2014
Ouisper - Training Data
59. Page 59 ATSIP, Sousse, May 18th, 2014
Ouisper - Video Stream Coding
T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For
An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing,
Honolulu Hawaii, USA, 2007.
Eigenvectors
Build a subset of
typical frames
Perform
PCA
Code new frames with their projections
onto the set of Eigenvectors
60. Page 60 ATSIP, Sousse, May 18th, 2014
Ouisper - Audio Stream Coding
ALISP Segmentation
Detection of quasi-stationary parts in the
parametric representation of speech
Assignment of segments to class using
unsupervised classification techniques
Phonetic Segmentation
Forced-alignement of speech with the text
Need of a relevant and correct phonetic
transcription of the uttered signal
Corpus-based synthesis
Need of a preliminary segmental description of the signal
61. Page 61 ATSIP, Sousse, May 18th, 2014
Audiovisual dictionary building
■ Visual and acoustic data
are synchronously
recorded
■ Audio segmentation is
used to bootstrap visual
speech recognizer
/e
-‐
r/
2)
Train
HMM
model
for
each
phonetic
class
/a
-‐
j//u
-‐
th/
Audiovisual dictionary
62. Page 62 ATSIP, Sousse, May 18th, 2014
Visuo-acoustic decoding
■ Visual speech recognition
– Train HMM model for each visual class
• Use multistream-based learning techniques
– Perform a « visuo-phonetic » decoding step
• Use N-Best list
• Introduce linguistic constraints
– Language model
– Dictionary
– Multigrams
■ Corpus-based speech synthesis
– Combine probabilistic and data-driven approach in the
audiovisual unit selection step.
63. Page 63 ATSIP, Sousse, May 18th, 2014
Speech recognition from
video-only data
ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh
ax w ih y uh r b uh k sh uw dh ax v er s p ey jh
Open your book to the first page
Ref
Rec
A wear your book shoe the verse page
Corpus-based synthesis driven by predicted phonetic lattice
is currently under study
64. Page 64 ATSIP, Sousse, May 18th, 2014
Ouisper - Conclusion
■ More information on
– http://www.neurones.espci.fr/ouisper/
■ Contacts
– gerard.chollet@enst.fr
– denby@ieee.org
– hueber@ieee.org
65. Page 65 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Processing
Conclusions and Perspectives
■ A talking face is more intelligible, expressive,
recognisable, attractive than acoustic speech
alone.
■ The combined use of facial and speech
information improves identity verification and
robustness to forgeries.
■ Multi-stream models of the synchrony of visual
and acoustic information have applications in
the analysis, coding, recognition and synthesis
of talking faces.