SlideShare une entreprise Scribd logo
1  sur  66
Télécharger pour lire hors ligne
Audio-Visual Speech Processing
Gérard Chollet
with Meriem Bendris, Hervé Bredin, Thomas Hueber,
Walid Karam, Rémi Landais, Patrick Perrot,
Eduardo Sanchez-Soto, Leila Zouari
ATSIP, Sousse, March 18th 2014
Page 2 ATSIP, Sousse, May 18th, 2014
Some motivations,…
■  A talking face is more intelligible, expressive,
recognisable, attractive than acoustic speech
alone.
■  The combined use of facial and speech
information improves identity verification and
robustness to forgeries.
■  Multi-stream models of the synchrony of visual
and acoustic information have applications in
the analysis, coding, recognition and synthesis
of talking faces.
■  SmartPhones, VisioPhones, WebPhones,
SecurePhones, Visio Conferences, Virtual
Reality worlds are gaining popularity.
Page 3 ATSIP, Sousse, May 18th, 2014
Some topics under study,…
■  Audio-visual speech recognition
–  Automatic ‘lip-reading’
■  Audio-visual speaker verification
–  Detection of forgeries
■  Speech driven animation of the face
–  Could we look and sound like somebody else ?
■  Speaker indexing
–  ‘Who is talking in a video sequence ?’
■  OUISPER : a silent speech interface
–  Corpus based synthesis from tongue and lips
Page 4 ATSIP, Sousse, May 18th, 2014
Audio Visual Speech Recognition
Dictionary Grammar
Acoustic models
Features
extraction
Decoder
Page 5 ATSIP, Sousse, May 18th, 2014
Video Mike (IBM, 2004)
■  IBM
■  2004
Page 6 ATSIP, Sousse, May 18th, 2014
Audio processing
■  Features  extraction  	
■  Digits  detection	
■  Digits  recognition:    	
•  Acoustic  parameters  :  MFCC	
•  Context  independent    HMMs	
•    Decoding  :  Time  synchronous  
algorithm	
■  Sound  effect	
–  Noise  :  Babble	
■  Recognition  experiments
Page 7 ATSIP, Sousse, May 18th, 2014
Video processing
■  Video  extraction	
■  Lips  localisation	
■  Images  interpolation  	
	
(same  frequency  as  speech)	
■  Features  extraction	
•  DCT  and  DCT2  (DCT+LDA)	
•  Projections    :  PRO  et  PRO2  
(PRO+LDA)	
■  Recognition  experiments
Page 8 ATSIP, Sousse, May 18th, 2014
Fusion techniques
q  Parameters fusion :
• Concatenation
•  Dimension decrease : Linear Discriminant Analysis (LDA)
•  Modelisation : classical HMM with one stream
q  Scores fusion : Multi-stream HMM
Page 9 ATSIP, Sousse, May 18th, 2014
Experimental results :
parameters fusion
0
10
20
30
40
50
60
70
80
90
100
-15 -10 -5 0 5 10S/N
%Accuracy
Speech only
Video only : Pro2
Video only : DCT2
AV Fusion : Pro2
AV Fusion : DCT2
Page 10 ATSIP, Sousse, May 18th, 2014
Experimental results :
Scores fusion at -5db
42
43
44
45
46
47
48
49
50
51
52
Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2
Page 11 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Fusion of face and speech for identity verification
■  Detection of possible forgeries
■  Compulsory ? for:
–  Homeland/firms security: restricted access,…
–  Secured computer login
–  Secured on-line signature of contracts
Page 12 ATSIP, Sousse, May 18th, 2014
12
Talking-face and
2D face sequence database
■  Data: video sequences (.avi) in which a short phrase in English is
pronounced / duration ≈ 10s (actual speech duration ≈ 2s)
■  Audio-video data used for talking faces evaluations
■  Same sequences used for 2D face from video sequences evaluations
■  430 subjects pronounced 4 phrases :
–  from a set 430 English phrases
–  2 indoor video files acquired during the first session
–  2 outdoor video files acquired during the second session
–  realistic forgeries created a posteriori
Page 13 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Features
Raw
Pixel
Value
DCT
Transform
Shape
Related
Many
Others
…
Raw
amplitude
« Classical »
MFCC coefficients
Many others
Page 14 ATSIP, Sousse, May 18th, 2014
Audio-Visual
Audio-Visual Subspaces
AudioVisual
Reduced
Audiovisual Subspace
Principal Component &
Linear Discriminant
Analysis
x
Correlated
Audio & Visual Subspaces
Co-inertia &
Canonical Correlation
Analysis
Page 15 ATSIP, Sousse, May 18th, 2014
Correspondence Measures
Audiovisual subspace Correlated subspaces
Gaussian
Mixture
Models
Neural
Networks
Coupled
HMM
Correlation
Mutual
Information
Page 16 ATSIP, Sousse, May 18th, 2014
Application to indexation
■  High-level requests
–  “Find videos where John Doe is speaking”
–  “Find dialogues between Mr X and Mrs Y”
–  “Locate the singer in this music video”
Raw
Energy
Raw
Pixel
Value
Correlation
Page 17 ATSIP, Sousse, May 18th, 2014
Who is speaking?
■  Face tracking
■  Correlation
–  Pixel of each face
–  Raw audio energy
■  Find maximum synchrony
Green: current speaker
Page 18 ATSIP, Sousse, May 18th, 2014
How	
  to	
  Perform	
  “Talking-­‐Face”	
  
Authen:ca:on?	
  
Face
recognition
Speaker verification
Score
fusion
What if…?
OK
OK OK
Deliberate imposture
Page 19 ATSIP, Sousse, May 18th, 2014
Biometrics
■  Identity Verification with Talking Faces
–  Speaker Verification
–  Face Recognition
■  What if?
Face
OK
Voice
OK
NO
X
Page 20 ATSIP, Sousse, May 18th, 2014
Identity Verification
Enrolment of client λ
Model for
client λ
Person ε pretending to be client λ
accepted if
rejected otherwise
Co-Inertia
Analysis
Equal Error Rate: 30 %
Page 21 ATSIP, Sousse, May 18th, 2014
Test
Replay Attacks Detection
Training
Co-IA
CCA
accepted if
rejected otherwise
Sync
Model
Page 22 ATSIP, Sousse, May 18th, 2014
Replay Attacks Detection
Genuine synchronized video Audio replay attack
Lips do not match audio perfectly
Equal Error Rate: 14 %
Page 23 ATSIP, Sousse, May 18th, 2014
Example of Replay attacks
Page 24 ATSIP, Sousse, May 18th, 2014
delayed video delayed audio
-5 0 +5
Alignment by maximum correlation
-1
Page 25 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Available features
–  Face : Face features (lip, eyes) à Face Modality
–  Speech à Speech Modality
–  Speech Synchrony à Synchrony Modality
Video
Page 26 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Face modality
–  Detection:
•  Generative models (MPT toolbox)
•  Temporal median Filtering
•  Eyes detection within faces
–  Normalization: geometry + illumination
Page 27 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Face Modality:
–  Two verification strategies and one single
comparison framework
•  Global = Eigenfaces:
–  Calculation of a set of directions (eigenfaces)
defining a projection space
–  Two faces are compared regarding their
projection on the eigenfaces space.
–  Learning data: BIOMET (130 pers.) + BANCA
(30 pers.)
Page 28 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Face Modality:
•  SIFT descriptors:
–  Keypoints extraction
–  Keypoints representation: 128-dimensional
vector (gradient orientation histogramme,…)
+ 4-dimensional position vector
SIFT descriptor
(dim 128)
Position (x,y) + scale + orientation
(dim 4)
Page 29 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Face Modality:
•  SVD-based matching method:
–  Compare two videos V1 and V2
–  Exclusive principle: One-to-one correspondences
between
»  Faces (global)
»  Descriptors (local)
–  Principle:
»  Proximity matrix computation between faces or
descriptors
»  Extraction of good pairings (made easy by SVD
computation)
–  Scores:
»  One matching score between global
representations
»  One matching score between local representations
Page 30 ATSIP, Sousse, May 18th, 2014
Variability !!!!
Page 31 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Speech Modality:
–  GMM-based approach;
•  One world model
•  Each speaker model is derived from the
World Model by MAP adaptation
•  Speech verification score: derived from
likelihood ratio
Page 32 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Synchrony Modality:
–  Principle: synchrony between lips and
speech carries identity information
–  Process:
•  Computation of a synchrony model (CoIA
analysis) for each person based on DCT
(visual signal) and MFCC (speech signal)
•  Comparison of the test sample with the
synchrony model
Page 33 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Experiments:
–  BANCA database:
•  52 persons divided into two groups (G1 and G2)
•  3 recording conditions
•  1 person à 8 recordings (4 client accesses, 4
impostor accesses)
•  Evaluation based on P protocol: 234 client
accesses and 312 impostor accesses
–  Scores:
•  4 scores per access (PCA face, SIFT face,
speech, synchrony)
•  Score fusion based on RBF-SVM: hyperplan
learned on G1/tested on G2 and conversely)
Page 34 ATSIP, Sousse, May 18th, 2014
Audiovisual identity verification
■  Experiments:
Page 35 ATSIP, Sousse, May 18th, 2014
SecurePhone
■  Technical solution that improves security
■  Biometric recognition
–  Makes use of VOICE, FACE and SIGNATURE
■  Electronic signature used to secure information exchange
Page 36 ATSIP, Sousse, May 18th, 2014
Biometrics in SecurePhone
■  Operation
Pre-processing
Modelling
Modelling
Modelling
Pre-processingPre-processing
Access DeniedAccess Granted
FUSION
Face Voice Written Signature
Modelling
Page 37 ATSIP, Sousse, May 18th, 2014
The BioSecure Multimodal Evaluation Campaign
■  Launched in April 2007
■  Many modalities including ‘Video sequences’ and
‘Talking Faces’
■  Development data and reference systems available
■  Evaluations on the sequestrated BioSecure data base
(1000 clients)
■  Debriefing workshop
■  More info on :
http://www.int-evry.fr/biometrics/BMEC2007/index.php
Page 38 ATSIP, Sousse, May 18th, 2014
Audio-­‐visual	
  forgery	
  scenarios	
  
■  Low-­‐effort	
  
–  “Paparazzi”	
  scenario	
  
•  The	
  impostor	
  owns	
  a	
  picture	
  of	
  the	
  face	
  and	
  a	
  recording	
  of	
  the	
  voice	
  of	
  the	
  
target	
  
–  “Big	
  Brother”	
  scenario	
  
•  The	
  impostor	
  owns	
  a	
  video	
  of	
  the	
  face	
  and	
  a	
  recording	
  of	
  the	
  voice	
  of	
  the	
  
target	
  
■  High-­‐effort	
  
–  “Imitator”	
  scenario	
  
•  The	
  impostor	
  owns	
  a	
  recording	
  of	
  the	
  voice	
  of	
  the	
  target	
  and	
  transforms	
  his	
  
own	
  voice	
  to	
  sound	
  like	
  the	
  target	
  
–  “Playback”	
  scenario	
  
•  The	
  impostor	
  owns	
  a	
  picture	
  of	
  the	
  face	
  of	
  the	
  target	
  and	
  animate	
  it	
  
according	
  to	
  his	
  own	
  face	
  moAon	
  
–  “Ventriloquist”	
  scenario	
  
•  combines	
  the	
  two	
  previous	
  ones	
  
Page 39 ATSIP, Sousse, May 18th, 2014
Detec:on	
  of	
  imposture	
  
Face modality:
ACCEPTED!
Voice modality:
ACCEPTED!
Synchronisation:
DENIED!
Page 40 ATSIP, Sousse, May 18th, 2014
40
Audio replay + “random” face
Talking-Face forgeries @ BMEC
Audio replay attack
"   Assumptions
§  Forger has recorded speech data from the genuine
user in outdoor (test) conditions
§  Forger is replaying the audio and uses his face in
front of the sensor
Stolen wave Audio replay + forger face
Page 41 ATSIP, Sousse, May 18th, 2014
41
CRAZY TALK Face animation + TTS
Talking-Face forgeries @ BMEC
Replay attack
"   Assumptions
§  Forger has stolen a picture
§  Forger uses a face animation software and TTS (male or
female)
§  Forger plays back the animation to the sensor
Stolen picture Contour detection Generated avi
Page 42 ATSIP, Sousse, May 18th, 2014
42
Picture presentation + TTS forgeries
Talking-Face forgeries @ BMEC
Replay attack
"   Assumptions
§  Forger has stolen a picture
§  Forger has printed the picture
§  Forger present the picture to the sensor and uses TTS
(same wave as for the face animation forgery)
Stolen picture
Presented picture
Page 43 ATSIP, Sousse, May 18th, 2014
43
Systems with fusion of
(face, speech)
face
score
speech
score
fusion
score
video sequence
frames
speech signal
Face verification
Speaker verification
Page 44 ATSIP, Sousse, May 18th, 2014
44
Voice Conversion methods
■ GMM	
  conversion	
  
–  Training	
  of	
  a	
  joined	
  Gaussian	
  model	
  
•  	
  parallel	
  corpus	
  of	
  aligned	
  sentences	
  of	
  both	
  source	
  and	
  target	
  
voice	
  
•  	
  MFCC	
  on	
  HNM	
  (Harmonic	
  plus	
  Noise	
  Model)	
  parameterizaAon	
  	
  
–  Speech	
  synthesis	
  from	
  Gaussian	
  model	
  
•  	
  Inversion	
  of	
  the	
  MFCC	
  
•  	
  Pitch	
  correcAon	
  
■ ALISP	
  conversion	
  
–  Very	
  low	
  debit	
  speech	
  compression	
  (500	
  bps)	
  method	
  
•  	
  Originally	
  developed	
  by	
  TELECOM-­‐ParisTech	
  
–  Indexed	
  segments	
  dicAonary	
  system	
  (of	
  the	
  target	
  voice)	
  
–  HNM	
  parameterizaAon	
  
Page 45 ATSIP, Sousse, May 18th, 2014
Voice conversion techniques
Definition: Process of making one person’s voice « source » sounds like another
person’s voice target
source target
Voice conversion
My name is John My name is John
Page 46 ATSIP, Sousse, May 18th, 2014
Principle of ALISP
Dictionary of
representative
segments
Dictionary of
representative
segments
Spectral analysis
Prosodic analysis
Selection of
segmental units
Segment
index
Prosodic
parameters
Input speech
Concatenative
synthesis
HNM
Output speech
CODER
Page 47 ATSIP, Sousse, May 18th, 2014
Details of Encoding
speech Spectral
analysis
Prosodic
analysis
HMM
Recognition
Dictionary of
HMM models of
ALISP classes
Synth unit A1
…
Synth unit A8
HMM A
Representative
units of the
class
Selection by
DTW
Prosodic
encoding
Index of
ALISP class
Index of
synth. unit
Pitch,
energy,
duration
Page 48 ATSIP, Sousse, May 18th, 2014
Details of decoding
Output speech
Synth unit A1
…
Synth unit A8
ALISP Index
Synth unit index
within class
Prosodic parameters
Loading
Synth unit
Concatenative
synthesis
Page 49 ATSIP, Sousse, May 18th, 2014
Principle of Alisp conversion
Learning step: one hour of target voice
- Parametric analysis: MFCC
- Segmentation based on temporal decompostion and vector quantization
- Stochastic modelling based on HMM
- Creation of representative units
Conversion step
- Parametric analysis: MFCC
- HMM recognition
- Selection of representative segment à DTW
Synthesis step
- Concatenation of representative
- HNM synthesis
Page 50 ATSIP, Sousse, May 18th, 2014
Voice conversion using ALISP
results
BREF databaseNIST database
Source
Result
TargetSource Target
Result
female female female male
Page 51 ATSIP, Sousse, May 18th, 2014
Demonstra:on	
  of	
  Voice	
  Conversion	
  
Impostor voice Converted voice with GMM Converted voice with ALISP
Target voiceConverted voice with ALISP+GMM
Page 52 ATSIP, Sousse, May 18th, 2014
3D reconstruction
•  3D face modeling from a front and a profile shot :
•  Animated face
•  https://picoforge.int-evry.fr/cgi-bin/twiki/view/
Myblog3d/Web/Demos
Page 53 ATSIP, Sousse, May 18th, 2014
Face Tranformation
Control point
selection
Image
segmentation
Figure	
  2:	
  Division	
  of	
  an	
  image	
  	
  Figure	
  1:	
  Control	
  points	
  selec8on	
  
Linear
transformation
between source
and target image
Blending
step
source
target
Page 54 ATSIP, Sousse, May 18th, 2014
Face Transformation
Source	
  
?	
  
54	
  
-­‐>	
  LocalisaAon	
  of	
  control	
  points	
  
-­‐>	
  Warping	
   -­‐>	
  Blending	
  
Cible	
  
?	
  
X’	
  =	
  f(X)	
  
p	
  =	
  αp	
  +	
  (1	
  –	
  α)p’	
  
X
X’	
  
p	
   p’	
  
Page 55 ATSIP, Sousse, May 18th, 2014
Face	
  transforma:on	
  (IBM)	
  
Page 56 ATSIP, Sousse, May 18th, 2014
Ouisper1 - Silent Speech Interface
■  Sensor-based system allowing speech communication via
standard articulators, but without glottal activity
■  Two distinct types of application
–  alternative to tracheo-oesophagal speech (TES) for persons
having undergone a tracheotomy
–  a "silent telephone" for use in situations where quiet must be
maintained, or for communication in very noisy environments
■  Speech Synthesis from ultrasound
and optical imagery of the tongue and lips
1) Oral Ultrasound synthetIc SPEech souRce
Page 57 ATSIP, Sousse, May 18th, 2014
Ouisper - System Overview
Ultrasound video
of the vocal tract
Optical video
of the speaker lips
Recorded
audio
Speech Alignment
Text
Visual Feature
Extraction
Audio-Visual Speech
Corpus
Visual Speech
Recognizer
Visual Unit
Selection
Audio Unit
Concatenation
T
R
A
I
N
I
N
G
T
E
S
T
Visual Data
N-best
Phonetic
or ALISP
Targets
Page 58 ATSIP, Sousse, May 18th, 2014
Ouisper - Training Data
Page 59 ATSIP, Sousse, May 18th, 2014
Ouisper - Video Stream Coding
T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For
An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing,
Honolulu Hawaii, USA, 2007.
Eigenvectors
Build a subset of
typical frames
Perform
PCA
Code new frames with their projections
onto the set of Eigenvectors
Page 60 ATSIP, Sousse, May 18th, 2014
Ouisper - Audio Stream Coding
ALISP Segmentation
Detection of quasi-stationary parts in the
parametric representation of speech
Assignment of segments to class using
unsupervised classification techniques
Phonetic Segmentation
Forced-alignement of speech with the text
Need of a relevant and correct phonetic
transcription of the uttered signal
Corpus-based synthesis
Need of a preliminary segmental description of the signal
Page 61 ATSIP, Sousse, May 18th, 2014
Audiovisual dictionary building
■  Visual and acoustic data
are synchronously
recorded
■  Audio segmentation is
used to bootstrap visual
speech recognizer
/e	
  -­‐	
  r/
2)	
  	
  Train	
  HMM	
  model	
  for	
  each	
  phonetic	
  class
/a	
  -­‐	
  j//u	
  -­‐	
  th/
Audiovisual dictionary
Page 62 ATSIP, Sousse, May 18th, 2014
Visuo-acoustic decoding
■  Visual speech recognition
–  Train HMM model for each visual class
•  Use multistream-based learning techniques
–  Perform a « visuo-phonetic » decoding step
•  Use N-Best list
•  Introduce linguistic constraints
–  Language model
–  Dictionary
–  Multigrams
■  Corpus-based speech synthesis
–  Combine probabilistic and data-driven approach in the
audiovisual unit selection step.
Page 63 ATSIP, Sousse, May 18th, 2014
Speech recognition from
video-only data
ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh
ax w ih y uh r b uh k sh uw dh ax v er s p ey jh
Open your book to the first page
Ref
Rec
A wear your book shoe the verse page
Corpus-based synthesis driven by predicted phonetic lattice
is currently under study
Page 64 ATSIP, Sousse, May 18th, 2014
Ouisper - Conclusion
■  More information on
–  http://www.neurones.espci.fr/ouisper/
■  Contacts
–  gerard.chollet@enst.fr
–  denby@ieee.org
–  hueber@ieee.org
Page 65 ATSIP, Sousse, May 18th, 2014
Audio-Visual Speech Processing
Conclusions and Perspectives
■  A talking face is more intelligible, expressive,
recognisable, attractive than acoustic speech
alone.
■  The combined use of facial and speech
information improves identity verification and
robustness to forgeries.
■  Multi-stream models of the synchrony of visual
and acoustic information have applications in
the analysis, coding, recognition and synthesis
of talking faces.
Page 66 ATSIP, Sousse, May 18th, 2014

Contenu connexe

Dernier

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Dernier (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

En vedette

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Atsip avsp17

  • 1. Audio-Visual Speech Processing Gérard Chollet with Meriem Bendris, Hervé Bredin, Thomas Hueber, Walid Karam, Rémi Landais, Patrick Perrot, Eduardo Sanchez-Soto, Leila Zouari ATSIP, Sousse, March 18th 2014
  • 2. Page 2 ATSIP, Sousse, May 18th, 2014 Some motivations,… ■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone. ■  The combined use of facial and speech information improves identity verification and robustness to forgeries. ■  Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces. ■  SmartPhones, VisioPhones, WebPhones, SecurePhones, Visio Conferences, Virtual Reality worlds are gaining popularity.
  • 3. Page 3 ATSIP, Sousse, May 18th, 2014 Some topics under study,… ■  Audio-visual speech recognition –  Automatic ‘lip-reading’ ■  Audio-visual speaker verification –  Detection of forgeries ■  Speech driven animation of the face –  Could we look and sound like somebody else ? ■  Speaker indexing –  ‘Who is talking in a video sequence ?’ ■  OUISPER : a silent speech interface –  Corpus based synthesis from tongue and lips
  • 4. Page 4 ATSIP, Sousse, May 18th, 2014 Audio Visual Speech Recognition Dictionary Grammar Acoustic models Features extraction Decoder
  • 5. Page 5 ATSIP, Sousse, May 18th, 2014 Video Mike (IBM, 2004) ■  IBM ■  2004
  • 6. Page 6 ATSIP, Sousse, May 18th, 2014 Audio processing ■  Features  extraction   ■  Digits  detection ■  Digits  recognition:     •  Acoustic  parameters  :  MFCC •  Context  independent    HMMs •   Decoding  :  Time  synchronous   algorithm ■  Sound  effect –  Noise  :  Babble ■  Recognition  experiments
  • 7. Page 7 ATSIP, Sousse, May 18th, 2014 Video processing ■  Video  extraction ■  Lips  localisation ■  Images  interpolation   (same  frequency  as  speech) ■  Features  extraction •  DCT  and  DCT2  (DCT+LDA) •  Projections    :  PRO  et  PRO2   (PRO+LDA) ■  Recognition  experiments
  • 8. Page 8 ATSIP, Sousse, May 18th, 2014 Fusion techniques q  Parameters fusion : • Concatenation •  Dimension decrease : Linear Discriminant Analysis (LDA) •  Modelisation : classical HMM with one stream q  Scores fusion : Multi-stream HMM
  • 9. Page 9 ATSIP, Sousse, May 18th, 2014 Experimental results : parameters fusion 0 10 20 30 40 50 60 70 80 90 100 -15 -10 -5 0 5 10S/N %Accuracy Speech only Video only : Pro2 Video only : DCT2 AV Fusion : Pro2 AV Fusion : DCT2
  • 10. Page 10 ATSIP, Sousse, May 18th, 2014 Experimental results : Scores fusion at -5db 42 43 44 45 46 47 48 49 50 51 52 Speech only AV : PRO AV :PRO2 AV : DCT AV : DCT2
  • 11. Page 11 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Fusion of face and speech for identity verification ■  Detection of possible forgeries ■  Compulsory ? for: –  Homeland/firms security: restricted access,… –  Secured computer login –  Secured on-line signature of contracts
  • 12. Page 12 ATSIP, Sousse, May 18th, 2014 12 Talking-face and 2D face sequence database ■  Data: video sequences (.avi) in which a short phrase in English is pronounced / duration ≈ 10s (actual speech duration ≈ 2s) ■  Audio-video data used for talking faces evaluations ■  Same sequences used for 2D face from video sequences evaluations ■  430 subjects pronounced 4 phrases : –  from a set 430 English phrases –  2 indoor video files acquired during the first session –  2 outdoor video files acquired during the second session –  realistic forgeries created a posteriori
  • 13. Page 13 ATSIP, Sousse, May 18th, 2014 Audio-Visual Speech Features Raw Pixel Value DCT Transform Shape Related Many Others … Raw amplitude « Classical » MFCC coefficients Many others
  • 14. Page 14 ATSIP, Sousse, May 18th, 2014 Audio-Visual Audio-Visual Subspaces AudioVisual Reduced Audiovisual Subspace Principal Component & Linear Discriminant Analysis x Correlated Audio & Visual Subspaces Co-inertia & Canonical Correlation Analysis
  • 15. Page 15 ATSIP, Sousse, May 18th, 2014 Correspondence Measures Audiovisual subspace Correlated subspaces Gaussian Mixture Models Neural Networks Coupled HMM Correlation Mutual Information
  • 16. Page 16 ATSIP, Sousse, May 18th, 2014 Application to indexation ■  High-level requests –  “Find videos where John Doe is speaking” –  “Find dialogues between Mr X and Mrs Y” –  “Locate the singer in this music video” Raw Energy Raw Pixel Value Correlation
  • 17. Page 17 ATSIP, Sousse, May 18th, 2014 Who is speaking? ■  Face tracking ■  Correlation –  Pixel of each face –  Raw audio energy ■  Find maximum synchrony Green: current speaker
  • 18. Page 18 ATSIP, Sousse, May 18th, 2014 How  to  Perform  “Talking-­‐Face”   Authen:ca:on?   Face recognition Speaker verification Score fusion What if…? OK OK OK Deliberate imposture
  • 19. Page 19 ATSIP, Sousse, May 18th, 2014 Biometrics ■  Identity Verification with Talking Faces –  Speaker Verification –  Face Recognition ■  What if? Face OK Voice OK NO X
  • 20. Page 20 ATSIP, Sousse, May 18th, 2014 Identity Verification Enrolment of client λ Model for client λ Person ε pretending to be client λ accepted if rejected otherwise Co-Inertia Analysis Equal Error Rate: 30 %
  • 21. Page 21 ATSIP, Sousse, May 18th, 2014 Test Replay Attacks Detection Training Co-IA CCA accepted if rejected otherwise Sync Model
  • 22. Page 22 ATSIP, Sousse, May 18th, 2014 Replay Attacks Detection Genuine synchronized video Audio replay attack Lips do not match audio perfectly Equal Error Rate: 14 %
  • 23. Page 23 ATSIP, Sousse, May 18th, 2014 Example of Replay attacks
  • 24. Page 24 ATSIP, Sousse, May 18th, 2014 delayed video delayed audio -5 0 +5 Alignment by maximum correlation -1
  • 25. Page 25 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Available features –  Face : Face features (lip, eyes) à Face Modality –  Speech à Speech Modality –  Speech Synchrony à Synchrony Modality Video
  • 26. Page 26 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face modality –  Detection: •  Generative models (MPT toolbox) •  Temporal median Filtering •  Eyes detection within faces –  Normalization: geometry + illumination
  • 27. Page 27 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face Modality: –  Two verification strategies and one single comparison framework •  Global = Eigenfaces: –  Calculation of a set of directions (eigenfaces) defining a projection space –  Two faces are compared regarding their projection on the eigenfaces space. –  Learning data: BIOMET (130 pers.) + BANCA (30 pers.)
  • 28. Page 28 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face Modality: •  SIFT descriptors: –  Keypoints extraction –  Keypoints representation: 128-dimensional vector (gradient orientation histogramme,…) + 4-dimensional position vector SIFT descriptor (dim 128) Position (x,y) + scale + orientation (dim 4)
  • 29. Page 29 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Face Modality: •  SVD-based matching method: –  Compare two videos V1 and V2 –  Exclusive principle: One-to-one correspondences between »  Faces (global) »  Descriptors (local) –  Principle: »  Proximity matrix computation between faces or descriptors »  Extraction of good pairings (made easy by SVD computation) –  Scores: »  One matching score between global representations »  One matching score between local representations
  • 30. Page 30 ATSIP, Sousse, May 18th, 2014 Variability !!!!
  • 31. Page 31 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Speech Modality: –  GMM-based approach; •  One world model •  Each speaker model is derived from the World Model by MAP adaptation •  Speech verification score: derived from likelihood ratio
  • 32. Page 32 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Synchrony Modality: –  Principle: synchrony between lips and speech carries identity information –  Process: •  Computation of a synchrony model (CoIA analysis) for each person based on DCT (visual signal) and MFCC (speech signal) •  Comparison of the test sample with the synchrony model
  • 33. Page 33 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Experiments: –  BANCA database: •  52 persons divided into two groups (G1 and G2) •  3 recording conditions •  1 person à 8 recordings (4 client accesses, 4 impostor accesses) •  Evaluation based on P protocol: 234 client accesses and 312 impostor accesses –  Scores: •  4 scores per access (PCA face, SIFT face, speech, synchrony) •  Score fusion based on RBF-SVM: hyperplan learned on G1/tested on G2 and conversely)
  • 34. Page 34 ATSIP, Sousse, May 18th, 2014 Audiovisual identity verification ■  Experiments:
  • 35. Page 35 ATSIP, Sousse, May 18th, 2014 SecurePhone ■  Technical solution that improves security ■  Biometric recognition –  Makes use of VOICE, FACE and SIGNATURE ■  Electronic signature used to secure information exchange
  • 36. Page 36 ATSIP, Sousse, May 18th, 2014 Biometrics in SecurePhone ■  Operation Pre-processing Modelling Modelling Modelling Pre-processingPre-processing Access DeniedAccess Granted FUSION Face Voice Written Signature Modelling
  • 37. Page 37 ATSIP, Sousse, May 18th, 2014 The BioSecure Multimodal Evaluation Campaign ■  Launched in April 2007 ■  Many modalities including ‘Video sequences’ and ‘Talking Faces’ ■  Development data and reference systems available ■  Evaluations on the sequestrated BioSecure data base (1000 clients) ■  Debriefing workshop ■  More info on : http://www.int-evry.fr/biometrics/BMEC2007/index.php
  • 38. Page 38 ATSIP, Sousse, May 18th, 2014 Audio-­‐visual  forgery  scenarios   ■  Low-­‐effort   –  “Paparazzi”  scenario   •  The  impostor  owns  a  picture  of  the  face  and  a  recording  of  the  voice  of  the   target   –  “Big  Brother”  scenario   •  The  impostor  owns  a  video  of  the  face  and  a  recording  of  the  voice  of  the   target   ■  High-­‐effort   –  “Imitator”  scenario   •  The  impostor  owns  a  recording  of  the  voice  of  the  target  and  transforms  his   own  voice  to  sound  like  the  target   –  “Playback”  scenario   •  The  impostor  owns  a  picture  of  the  face  of  the  target  and  animate  it   according  to  his  own  face  moAon   –  “Ventriloquist”  scenario   •  combines  the  two  previous  ones  
  • 39. Page 39 ATSIP, Sousse, May 18th, 2014 Detec:on  of  imposture   Face modality: ACCEPTED! Voice modality: ACCEPTED! Synchronisation: DENIED!
  • 40. Page 40 ATSIP, Sousse, May 18th, 2014 40 Audio replay + “random” face Talking-Face forgeries @ BMEC Audio replay attack "   Assumptions §  Forger has recorded speech data from the genuine user in outdoor (test) conditions §  Forger is replaying the audio and uses his face in front of the sensor Stolen wave Audio replay + forger face
  • 41. Page 41 ATSIP, Sousse, May 18th, 2014 41 CRAZY TALK Face animation + TTS Talking-Face forgeries @ BMEC Replay attack "   Assumptions §  Forger has stolen a picture §  Forger uses a face animation software and TTS (male or female) §  Forger plays back the animation to the sensor Stolen picture Contour detection Generated avi
  • 42. Page 42 ATSIP, Sousse, May 18th, 2014 42 Picture presentation + TTS forgeries Talking-Face forgeries @ BMEC Replay attack "   Assumptions §  Forger has stolen a picture §  Forger has printed the picture §  Forger present the picture to the sensor and uses TTS (same wave as for the face animation forgery) Stolen picture Presented picture
  • 43. Page 43 ATSIP, Sousse, May 18th, 2014 43 Systems with fusion of (face, speech) face score speech score fusion score video sequence frames speech signal Face verification Speaker verification
  • 44. Page 44 ATSIP, Sousse, May 18th, 2014 44 Voice Conversion methods ■ GMM  conversion   –  Training  of  a  joined  Gaussian  model   •   parallel  corpus  of  aligned  sentences  of  both  source  and  target   voice   •   MFCC  on  HNM  (Harmonic  plus  Noise  Model)  parameterizaAon     –  Speech  synthesis  from  Gaussian  model   •   Inversion  of  the  MFCC   •   Pitch  correcAon   ■ ALISP  conversion   –  Very  low  debit  speech  compression  (500  bps)  method   •   Originally  developed  by  TELECOM-­‐ParisTech   –  Indexed  segments  dicAonary  system  (of  the  target  voice)   –  HNM  parameterizaAon  
  • 45. Page 45 ATSIP, Sousse, May 18th, 2014 Voice conversion techniques Definition: Process of making one person’s voice « source » sounds like another person’s voice target source target Voice conversion My name is John My name is John
  • 46. Page 46 ATSIP, Sousse, May 18th, 2014 Principle of ALISP Dictionary of representative segments Dictionary of representative segments Spectral analysis Prosodic analysis Selection of segmental units Segment index Prosodic parameters Input speech Concatenative synthesis HNM Output speech CODER
  • 47. Page 47 ATSIP, Sousse, May 18th, 2014 Details of Encoding speech Spectral analysis Prosodic analysis HMM Recognition Dictionary of HMM models of ALISP classes Synth unit A1 … Synth unit A8 HMM A Representative units of the class Selection by DTW Prosodic encoding Index of ALISP class Index of synth. unit Pitch, energy, duration
  • 48. Page 48 ATSIP, Sousse, May 18th, 2014 Details of decoding Output speech Synth unit A1 … Synth unit A8 ALISP Index Synth unit index within class Prosodic parameters Loading Synth unit Concatenative synthesis
  • 49. Page 49 ATSIP, Sousse, May 18th, 2014 Principle of Alisp conversion Learning step: one hour of target voice - Parametric analysis: MFCC - Segmentation based on temporal decompostion and vector quantization - Stochastic modelling based on HMM - Creation of representative units Conversion step - Parametric analysis: MFCC - HMM recognition - Selection of representative segment à DTW Synthesis step - Concatenation of representative - HNM synthesis
  • 50. Page 50 ATSIP, Sousse, May 18th, 2014 Voice conversion using ALISP results BREF databaseNIST database Source Result TargetSource Target Result female female female male
  • 51. Page 51 ATSIP, Sousse, May 18th, 2014 Demonstra:on  of  Voice  Conversion   Impostor voice Converted voice with GMM Converted voice with ALISP Target voiceConverted voice with ALISP+GMM
  • 52. Page 52 ATSIP, Sousse, May 18th, 2014 3D reconstruction •  3D face modeling from a front and a profile shot : •  Animated face •  https://picoforge.int-evry.fr/cgi-bin/twiki/view/ Myblog3d/Web/Demos
  • 53. Page 53 ATSIP, Sousse, May 18th, 2014 Face Tranformation Control point selection Image segmentation Figure  2:  Division  of  an  image    Figure  1:  Control  points  selec8on   Linear transformation between source and target image Blending step source target
  • 54. Page 54 ATSIP, Sousse, May 18th, 2014 Face Transformation Source   ?   54   -­‐>  LocalisaAon  of  control  points   -­‐>  Warping   -­‐>  Blending   Cible   ?   X’  =  f(X)   p  =  αp  +  (1  –  α)p’   X X’   p   p’  
  • 55. Page 55 ATSIP, Sousse, May 18th, 2014 Face  transforma:on  (IBM)  
  • 56. Page 56 ATSIP, Sousse, May 18th, 2014 Ouisper1 - Silent Speech Interface ■  Sensor-based system allowing speech communication via standard articulators, but without glottal activity ■  Two distinct types of application –  alternative to tracheo-oesophagal speech (TES) for persons having undergone a tracheotomy –  a "silent telephone" for use in situations where quiet must be maintained, or for communication in very noisy environments ■  Speech Synthesis from ultrasound and optical imagery of the tongue and lips 1) Oral Ultrasound synthetIc SPEech souRce
  • 57. Page 57 ATSIP, Sousse, May 18th, 2014 Ouisper - System Overview Ultrasound video of the vocal tract Optical video of the speaker lips Recorded audio Speech Alignment Text Visual Feature Extraction Audio-Visual Speech Corpus Visual Speech Recognizer Visual Unit Selection Audio Unit Concatenation T R A I N I N G T E S T Visual Data N-best Phonetic or ALISP Targets
  • 58. Page 58 ATSIP, Sousse, May 18th, 2014 Ouisper - Training Data
  • 59. Page 59 ATSIP, Sousse, May 18th, 2014 Ouisper - Video Stream Coding T.Hueber , G. Aversano, G.Chollet, B. Denby, G. Dreyfus, Y. Oussar, P. Roussel, M. Stone, “EigenTongue Feature Extraction For An Ultrasound-based Silent Speech Interface,” IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu Hawaii, USA, 2007. Eigenvectors Build a subset of typical frames Perform PCA Code new frames with their projections onto the set of Eigenvectors
  • 60. Page 60 ATSIP, Sousse, May 18th, 2014 Ouisper - Audio Stream Coding ALISP Segmentation Detection of quasi-stationary parts in the parametric representation of speech Assignment of segments to class using unsupervised classification techniques Phonetic Segmentation Forced-alignement of speech with the text Need of a relevant and correct phonetic transcription of the uttered signal Corpus-based synthesis Need of a preliminary segmental description of the signal
  • 61. Page 61 ATSIP, Sousse, May 18th, 2014 Audiovisual dictionary building ■  Visual and acoustic data are synchronously recorded ■  Audio segmentation is used to bootstrap visual speech recognizer /e  -­‐  r/ 2)    Train  HMM  model  for  each  phonetic  class /a  -­‐  j//u  -­‐  th/ Audiovisual dictionary
  • 62. Page 62 ATSIP, Sousse, May 18th, 2014 Visuo-acoustic decoding ■  Visual speech recognition –  Train HMM model for each visual class •  Use multistream-based learning techniques –  Perform a « visuo-phonetic » decoding step •  Use N-Best list •  Introduce linguistic constraints –  Language model –  Dictionary –  Multigrams ■  Corpus-based speech synthesis –  Combine probabilistic and data-driven approach in the audiovisual unit selection step.
  • 63. Page 63 ATSIP, Sousse, May 18th, 2014 Speech recognition from video-only data ow p ax n y uh r b uh k t uw dh ax f er s t p ey jh ax w ih y uh r b uh k sh uw dh ax v er s p ey jh Open your book to the first page Ref Rec A wear your book shoe the verse page Corpus-based synthesis driven by predicted phonetic lattice is currently under study
  • 64. Page 64 ATSIP, Sousse, May 18th, 2014 Ouisper - Conclusion ■  More information on –  http://www.neurones.espci.fr/ouisper/ ■  Contacts –  gerard.chollet@enst.fr –  denby@ieee.org –  hueber@ieee.org
  • 65. Page 65 ATSIP, Sousse, May 18th, 2014 Audio-Visual Speech Processing Conclusions and Perspectives ■  A talking face is more intelligible, expressive, recognisable, attractive than acoustic speech alone. ■  The combined use of facial and speech information improves identity verification and robustness to forgeries. ■  Multi-stream models of the synchrony of visual and acoustic information have applications in the analysis, coding, recognition and synthesis of talking faces.
  • 66. Page 66 ATSIP, Sousse, May 18th, 2014