Interactive Voice Con: Optimizing Voice Processing for Smart Speakers & Devices

Interactive Voice Con
Successful smart speakers &
voice-enabled products
Platinum Sponsor

Introducing the Speakers
Paul Beckmann
• PhD, MS, and BS from MIT. All in EE.
• Technical specialties: signal
processing, audio product
development, and tools.
Mike Klasco
• Combined MS/PhD ABT NYU
• Audio product development,
acoustics, transducers, materials
and sourcing
2020 Interactive Voice Con
Founder and CTO of DSP Concepts Founder and CEO, Menlo Scientific

Outline
• Kickoff
• Voice Processing Theory [45 minutes]
• Algorithms
• Measuring performance
• Processor requirements
• Product design guidelines
• Break [15 minutes]
• Demos? [15 minutes]
• What Happens in Practice [30 minutes]
• Microphone integration issues
• The enclosure – the space between the mics and speakers
• Loudspeakers, acoustic
• Q&A [15 minutes]

Speech Recognition

Types of Voice Recognition Algorithms
• Voice trigger
• Identifies a single word or phrase like “Alexa” or “Hey Siri”
• Small vocabulary voice recognition
• Fixed vocabulary set for embedded applications. 10’s of
words.
• “Turn on the lights”, “Next track”, etc.
• Full voice recognition
• Large vocabulary set. 1,000’s of words
• “Play Beatles”
• Natural language understanding (NLU)
• Combines application specific information for more flexible
user interface
• “Play Music by the Beatles”, “Give me Beatles Music”, “I want
to listen to music by the Beatles”
• Can be combined with small vocabulary set

Audio Front End = Microphone Cleaner
Audio Front
End
Voice
Recognition
Mic Array N Channels 1 Channel
The Audio Front End (AFE) cleans up signals to improve the
performance of the voice recognition. It is like glasses for a camera.
Interfering
Noise
Device
Playback
Desired
Speech

Audio Front End Details
Echo
Canceler
Trigger Word
& Voice
Recognition
Mic Array N Channels
1 Channel
Direction
of Arrival
Noise
Reduction
Beam-
former
Eliminates
loudspeaker
sound during
device playback
Determines
location of
sound source.
Used to steer
beamformer.
Combines multiple
microphone signals
to improve signal
quality.
Removes
various types of
noise

Comparing Amazon and Google
• 2 microphones only
• 65 to 71 mm spacing
• Mono or stereo
• High-end application processor required
• No variation in products
• No variation in performance
• Performance lags behind AVS
Google AFE and Trigger
Word
3rd Party
AFE
Amazon
Trigger
Word
ASR
ASR
• Any number of microphones
• Any spacing
• Any number of playback channels
• Application processor or MCU solutions
• Wide variety of designs
• 2 to 7 microphones
• Different form factors
• Better performance
• Low cost designs possible

AVS Integration for AWS IoT
• Cost effective way to add Alexa voice
features
• Connects to the cloud
• Uses an RTOS and lightweight MQTT
network stack
• Suitable for low cost microcontrollers
• Will expand voice to a much larger
number of products
https://docs.aws.amazon.com/iot/latest/developerguide/avs-integration-aws-iot.html
(AKA. “Alexa for Microcontrollers”)

Trigger Word
• Voice recognition algorithm trained for a single word or phrase
• “Alexa”, “OK Google”, “Bixby”, “Siri”, “Cortana”, etc.
• Available from multiple suppliers
• Amazon, Google, Baidu, etc.
• Sensory “Truly Handsfree”
• PicoVoice / SoundHound / Cyberon / etc.
• They all use machine learning
• Often optimized for low power consumption
• Sound → Voice Activity Detector → Key word detector
• Large models perform better
• Sensory: 17 kbyte → 1 Mbyte

Characterizing Trigger Performance
• Probability of False Alarm
• How many times does the algorithm
accidentally trigger over a 24-hour period?
• Probability of Miss
• What % of trigger words are not detected by the
algorithm
• Trigger word algorithms have an adjustable
“sensitivity” setting that allows you to tradeoff
false alarms and misses.
• Amazon requires <3 false alarms per 24 hours
of continuous speech
False Alarm Rate
ProbabilityofDetection
100%
Ideal operating point
Tune sensitivity based on allowable
false alarm rate

Wake Word Performance in Noise
SNR at microphone is main driver
of wake word performance
• Independent of distance
• Independent of room
reflections / reverb (for normal
household environments)
Improve your SNR to improve your wake
word performance.

Beamforming

Beamforming Principles
• Beamformers are spatial filters. They
pass signals from certain directions and
reduce signals from other directions.
• Performance depends heavily upon the
geometry of the microphone array
• Fixed beamformers utilize FIR filters
• Time domain or frequency domain
• There are many ways to compute the filter
coefficients (MVDR, DAS, etc.)
h1[n]
h2[n]
h3[n]
h4[n]
FIR Filters

DSPC Design Method: Maximize SNR
• Inputs to design
• Microphone geometry
• Look angle and beam width
• Diffuse field noise level
• Microphone SNR
• Signal is person’s voice in specified beam
• Noise = diffuse field noise + microphone self
noise
• Iterative design procedure maximizes SNR

SNR vs. Frequency

Optimal Array Geometries
Far Field Products
180 or 360 Degree
Smart speakers
Middle of the room
180 Degree
Set-top box
Side of the room
Flat Line
Array
TVs, appliances
On a wall
High-End
Standard
Low-Cost
40 to 70 mm diameter works.
70 mm works the best
25 mm spacing between mics
75 mm total length
+7 dB +6.5 dB
+5 dB
+2 dB
+3 dB
+2 dB
+4 dB

SNR vs. Mic Geometry
Assumptions:
• 71 mm diameter
• Microphone array is in
diffuse field noise with SNR
= 50 dB
• Speech is at 60 dB in the
direction of the beam
• Beam width is 45 degrees
• Microphone SNR = 65 dB
• Look angle = 0 degrees

Linear Arrays
• Linear arrays work well when in an end-fire
configuration.
• Requires person to be in a specified location.
• Provides 4 to 5 dB SNR improvement
• Broadside arrays work poorly and should be
avoided.
• Very little SNR improvement to low frequencies where
the bulk of speech energy is
• Use broadside arrays only as a last resort when the
industrial design dictates no other options
• Television
• Wall panel
End-fire
Broadside
Intuition: beamformers use time
differences to steer beam. In broadside,
voice arrives at the same time at both
mics.

Noise Reduction

Stationary Noise Reduction
Before
After
Example demonstrates improvement
in automotive environments
• Effective against:
• Fan noise
• Automotive road noise
• Microphone self noise
• Creates a model of the background
noise and then removes in real-time
• Improves ASR performance by 2 to 3
dB

Interference Canceler
• Effective against noise from:
• TVs
• Appliance self noise
• Air conditioners
• Requires a minimum of 2 microphones
• Combines beamforming, adaptive filtering,
and other statistical signal processing
techniques
• Effective for music and speech interferers
• Improves ASR performance up to 30 dB!
2 Microphone Example

Adaptive Interference Canceler Performance
• Measured in a typical living
room environment
• Interfering music noise
played
• Speech at constant level (62
dBC) at DUT
• Varied music level
• Speech and noise 2 meters
from DUT
Echo Plus
7-mic
DSPC 2-
mic
DSPC 4-
mic
8 dB
better
DSPC 6-
mic
11 dB
better
Echo 2
7-mic
Relative to Amazon Echo Plus and Echo 2

AEC

Acoustic Echo Cancellers (AEC)
• Eliminates loudspeaker sound at the microphone
• Enables Voice UI to function while music or text-to-
speech is active
• Music is usually ducked after the wake word is detected
• Best algorithms operate in the frequency domain
• Better cancellation
• Faster convergence
• Lower computation
• ERL = Echo Return Loss quantifies performance = How
many dB of loudspeaker signal is canceled by the AEC
Demo Setup
Single microphone with
loudspeaker close to the mic.
Mono playback in home
environment.

Factors Affecting AEC Performance
• What type of algorithm are you using?
• Time domain vs frequency domain
• LMS vs Kalman vs Other?
• Echo tail length
• How many msec of audio can you cancel?
• Longer is better but requires more processing
and memory
• Far-field smart speakers require 150 to 200
msec of echo tail
• Reverberation time of the room (lower
is better)
• Linearity of your loudspeakers

Speaker Distortion Affects AEC
• This is usually the limiting factor for AEC performance
• Loudspeakers distort when playing loud or low frequencies
• Speakers need to be tuned to minimize distortion
• Rule of thumb:
1% THD AEC up to 40 dB
• Product developers must tradeoff low frequency sound quality vs. voice
performance

Rule of Thumb for Speaker Distortion
1. Play a low frequency sine wave through
your loudspeaker and plot the
spectrum
2. You’ll see harmonics at multiples of the
fundamental frequency
3. The largest harmonic determines the
absolute limit of the echo canceler
4. ERLE performance based on difference
between fundamental and harmonic
5. Repeat at different output levels and
frequencies
OK. 30 dB down = 30 dB max ERLE.
Bad. 15 dB down = 15 dB max
ERLE

AECs and Speaker Processing
Reference signal must be taken
after nonlinear processing
DRC = Dynamic range compression.
This includes nonlinear processing like
compressors and limiters
EQ
Ref
DRC DAC AMP
EQ
Ref
DRC DAC AMP
Cross-
Over
Crossovers after the DRC are
allowed. Higher order crossover
perform better.

Multichannel Echo Cancelers
• Some applications
require multichannel
echo cancelers (e.g.,
soundbars)
• For optimal performance,
you need to cancel all the
channels. Downmixing
reduces performance.
• The example to the right
shows what happens
when you have a 3
channel product and
apply a 2 channel AEC
Full performance when using a
3 channel AEC to cancel L, R,
and C speakers.
Reduced performance when
downmixing to 2 channels and
using a stereo echo canceler.
L’ = L + 0.5 * C
R’ = R + 0.5 * C
Performance reduced
by 5 to 10 dB

Woofer Reference Mic
• Work done in conjunction with Vesper
• Uses a new high AOP microphone
placed directly in front of the woofer
• Advanced processing improves ERL by
up to 15 dB
• Trigger word performance at max
playback level:
• Standard processing: 63%
• Advanced processing: 91%
• Similar feature used in the HomePod

Amazon Test Setups
Used for
most tests
Used for AEC test only

Understanding Amazon Results
• False Alarm Tests
• Number of false alarms using Amazon’s 24-hour continuous talking test track
• The lower the better
• Trigger Detection
• % of time that the device wakes up when “Alexa” is spoken
• Tested in silence, kitchen noise, music noise, and during music playback
• The higher the better
• Response Accuracy Rate (RAR)
• % of time that the cloud accurately understood the question (i.e., “Alexa, what is the
capital of China”)
• Tested in silence, kitchen noise, and music noise
• The higher the better

Testing Scenarios
Silence
No interfering sound, uttering “Alexa” at 62 dBC
Kitchen Noise (0, -3 dB, -6 dB)
Alexa utterance at 62 dBC / Noise at 62, 65, and 68 dBC
Music Noise (0, -3 dB, -6 dB)
Alexa utterance at 62 dBC / Music at 62, 65, and 68 dBC
Acoustic Echo Canceler
Music playback at 90 dBC while trigger words are played at 62 dBC.

Living Room Results – Trigger Detection

Living Room Results - RAR

Many Performance Levels
Low Power / Near-field
1 or 2 mics
ARM Cortex-M4
20 to 30 MHz
Basic Far-Field
2-mics. Mono
ARM Cortex-M7 or Cortex-A53
200 MHz
High-Performance Far-Field
4+ mics. Stereo
ARM Cortex-A53
350 to 600 MHz
High-Performance Far-Field
4+ mics. Multichannel
ARM Cortex-A53
900 to 1200 MHz

Processor Comparisons
ARM Cortex-M4
ARM Cortex-M7
ARM Cortex-A35
ARM Cortex-A53
ARM Cortex-A72
Tensilica HiFi 4
0.26
0.45
0.37
0.48
0.98
1.00
Processor efficiency per MHz. The larger the better.
ST, NXP, Renesas, Ambiq, Quicklogic
ST, NXP
Mediatek
NXP, Amlogic, Qualcomm
Coming soon!
NXP, Mediatek, Amlogic
ARM Cortex-A53 is the sweet
spot for smart speakers.

Smart Speaker Designs
• 360-degree operation
• Microphones on top of product
• 40 to 75 mm diameter
• Physically separate microphones and
loudspeakers for best performance
• Mono or stereo playback
High-End
Standard

Sound Bar Designs
• Microphones on top of product near center of device
• 60 to 75 mm design
• Physically separate microphones and loudspeakers
for best performance
• Stereo or multichannel playback (up to 7 reference
channels)
• Compatible with Dolby Atmos
High-
End
Standard

TV Designs
Placement options
• Top is better than bottom
• Further away from speakers
• Bottom usually wins out because of
lower cost
• Mics do not have to be centered
• 2 mics sufficient
Good
Better

Set-Top Box Designs
• Top of Device
• Tethered “puck”
• Support for optional internal
speaker for voice playback
• Audio playback through HDMI
High-
End
Standard

Appliance / Tablet Designs
• 2 or 4 microphone linear array
• 25 to 75 mm design
• Physically separate microphones
and loudspeakers for best
performance
• Mono or stereo playback
Good
Better

Design Guidelines – Microphones
Far Field Products
• Microphones should be placed on the top of the product, if possible.
• Microphones should be on a flat horizontal surface
• Microphones should be visible to the user (not occluded)
• Flat line arrays are not recommended. These are only last choice, if
necessary. (Microphone arrays work best if the microphones are
displaced in the horizontal plane)
• Microphones need to be properly ported (see design guidelines from
microphone vendor)
• 4 microphones is sufficient for most products

Design Guidelines – Microphones
Far Field Products
• SNR of 65 dB. Higher SNRs provide no benefit for voice recognition but has
benefits for voice communication
• Gain matching:
• +/- 1 dB in the range 200 to 6 kHz (recommended)
• +/- 1dB in 200 to 4 kHz and +/-3 dB in 4k to 7 kHz (required)
• Microphone AOP must be high enough so that the system doesn’t clip when
loudspeakers are played at full volume. Recommendations:
• 120 dB for smart speakers
• 130 dB for sound bars
• 40 to 70 mm microphone spacing is recommended. As small as 20 mm is
possible with some degradation in performance.

Microphone Acoustical Porting
(No Common Cavity)
MEMS
Mic
Vent
hole
Case
PCB
MEMS
Mic
Vent
hole
You need individual gaskets to
make a direct connection
between each mic and its vent
hole
If you block a microphone hole
with putty, you should see the
level drop by at least 30 dB
MEMS
Mic
Case
PCB
MEMS
Mic
Gasket Gasket
This design with a common
cavity shared by all
microphones won’t work.

Design Guidelines – Microphones (A)
In Ear Products
• 2 microphones are sufficient for most products
• Use 2 microphones in an end fire configuration
pointing towards the mouth
• Space microphones as far apart as possible. 10
mm is the minimum spacing. 20 mm is
preferred
• Microphone on end of “boom” improves
performance

Overview
What Happens in Practice
• Microphone selection
• The Physical world in front of the
mic
• No Man’s Land between the mic
and speaker (leakage)
• Loudspeakers – good, bad and ugly
• Software integration issues

MEMs Microphone selection cheat sheet
• Analog or digital?
• Analog single-ended or
balanced?
• Top or bottom port?
• Standard size or compact ?
• AOP – Acoustic Overload
Point?
• S/N – Signal to Noise?
• Sensitivity (asic gain)?
• Robustness (IPXX)?

MEMs Microphones – what is inside?
• MEMs mic element + ASIC in a package
• Wiring between mems mic die and
ASIC
• Typical package envelope of 3.50mm x
2.65mm x 0.98mm
• Smaller foot print on some models but
reduced back volume = reduced s/n
• Faraday shield on some models

Microphones - Analog vs digital?
What are the mic inputs on codec or soc (System On Chip)?
• Analog single-ended
• Analog pseudo balanced
• Digital – PDM

Microphones – Top or Bottom port?
• The MEMs smt package can have the
sound aperture either on the top or
bottom
• If on the bottom then the circuit
board it is flow soldered to the flex
pcb) and have a hole that aligns to
the MEMs mic port
• Bottom port warning
• Sealing - back port smt seal eyelet

Microphones – signal to noise
• S/N was once a deal killer for most
serious applications, MEMs mics
have caught up with ECMs with
commodity analog and digital
MEMs reaching beyond 60 dB s/n.
• Active noise canceling
headphones, hearing aides, voice
command desire 65 dB s/n or
better
• 70+ dB from a few vendors by the
start of 2021 (but this keeps
slipping!)
• Better s/n = less mics?
Some discussion of higher s/n enables
reduction in mics required

Microphones
• Analog MEMs mics - single-ended or balanced differential outputs?
• balanced output analog is good defensive engineering if your product
will have longer wire runs, digital noise, emi/rf floating around
• How differential is MEMs mic topology?
True differential capacitive MEMs mics use dual grids for improved noise
immunity over single ended for high noise immunity

Microphones - Digital
• Digital MEMs mics offer greater immunity to interference than analog
MEMs
• time to market considerations avoiding having to tweak and rework your
board layout if noise problems await you, then digital is the way to go
• If the mic performance is critical for your type and class of product analog
may be better with external premium codec (both AOP and noise floor

Microphones – Acoustic Overload Point (AOP)
• Is AOP due to mic element saturation vs asic overload clipping?
• MEMs analog mics typically have better acoustic overload point (aop) which is
where serious distortion sets in (codec overload before MEMs mic element)
• Analog MEMs overload a bit more gracefully than digital as when an A/D codec
overloads it is a line in the sand and nasty.
• Digital MEMs aop can be as low as 116 dB and more typically 120 dB. Analog
aop tends to be over 120 dB and can be 130+ dB on some MEMs mics.
• Vesper’s piezo MEMs mics have versions with very high AOP.

Microphones - Directivity
• MEMs mics are omni-directional
• For achieving directional
characteristics they are used in arrays
• One requirement for mic arrays is that
the mics are closely matched in
sensitivity and response and will be
able maintain that uniformity over
time

The physical world in front of
the mic

Microphones – the world around the mic
Key topics
• MEMs mics are mounted to flex
PCB using smt reflow along
with the rest of the smt
components
• Port Helmholtz resonance –
moving it out of band
• The port and wind noise
• Laminar entry
• Acoustic mesh

Microphones -
What are membranes for?
Woven and non-woven used for;
• wind noise, water blocking
• acoustic resistance determines crossover to DSP wind noise filtering
• Dust problems – internal membrane (within package) blocks smt reflow gasses
• Field use issue - shift over time
- gunk in the membrane over the mics facing facing stove top

Wind noise blocking/acoustic mesh
• Mic element overloaded/
saturated by wind
• Wind pressure must be blocked
acoustically (acoustic resistance
membrane)
• Mic overload cannot be fixed by
DSP (but some turbulence can
be filtered out)
• Acoustic mesh can also block
liquids
• (hydrophobic & oleophobic )

Port and wind noise
• Laminar entry (flared aperture)
• (turbulence in port to be
avoided)
• Port Helmholtz resonance peak
– moving it out of band
• Acoustic mesh damps peak Q

The physical world between the
mic and speaker

Leakage between the mic & speaker
Audio output leakage is both airborne and through the enclosure structure
• Minimizing Airborne leakage
• keep the mic(s) and speakers as far apart as possible
• avoid overlapping the mic(s) pickup pattern and speaker radiation pattern
• Structural transconduction (microphonics)
• Enclosure housing – ribs, joints, wall thickness
• Plastics are not all equal
• speaker sub-enclosure isolation mounts (grommets or gaskets)
• mic isolation

-
Construction and Materials
• Plastics have different
acoustical characteristics
• Stiffness and damping are
key factors
• Compatibility considerations
• Shrink
• Tool temperature
• Flow
• Impact strength
• Sink marks/wall thickness

-
The Incumbent plastics
• ABS
• PC
• ABS+PC
• PP

-
• TreBlend (Ineos) PA/SAN
• Cellulose Plastics
• Treva (Eastman)
• Symbio (Sappi)
• Thicker walls/ ribs without sink marks
Acoustically engineered plastics
Genelec M040 – NCE enclosure

The physical world of speakers
and the AEC Achilles heel -
distortion

- Enclosure Mechanical Engineering E
• Open the window more and more bugs come in
• More power and more bass = no gain without pain
• Increase acoustic output before feedback and AEC breakdown by
reducing the cabinet resonance peak
• Extending low-end response of product will shake things up more

Speaker Nonlinearities AEC issues
• Speaker distortion nonlinearities are the enemy of AEC
• Loudspeaker nonlinearities effect AEC
• - low-end distortion impact on aec yet not audible for listening
• Fine tuning of suspension and motor nonlinearities are critical
• or source off-the-shelf application-specific speakers optimized for AEC and ANC

-
50 mm AEC / ANC optimized speakers
• Application-specific ANC and AEC high
linearity /lower distortion speakers to meet
TIA 930
• Typically around 50 mm diameter
• SEAS
• Tymphany
• Stetron

subVo servo feedback correction
Next generation solution for increased AEC headroom
• subVo bend-sensor provides distortion reduction at
the lower octaves enabling increased AEC
headroom
• Precision position sensor provides error correction
feedback
• 10 dB of feedback = 10 dB of piston range distortion
reduction

Software Integration Issues

Software Integration Challenges
• Real-time CPU load
• Wrong interrupt levels
• Dropping samples / blocks
• Non constant latency between mics and reference signals
• Misconfigured PDM filters
• Different clocks for mics and reference signals

Example #1: Noisy PDM Microphones
PDM to
PCM
Converter
PCM
Samples
PDM
Bitstream
Problem Statement
• ASR accuracy only 72% in quiet speech conditions
• High quality microphone:
• -41 dB sensitivity / 66 dB SNR
• Noise floor expected at 28 dBA
• Noise floor measured at 39 dBA
• Root cause
• PDM to PCM converter was implemented with
16-bit math
• Generated noise floor was at -96 dBFS → 39
dBA
• Solution
• Implement PDM to PCM conversion in software
• ASR accuracy improved to 94%

Example #2: Incorrect thread priorities
CPU Load Problems
Audio processing was taking 18% on average but there
were large spikes. Bluetooth thread priority was
incorrectly set higher than real-time audio processing.
Corrected Thread Priorities
Steady and consistent CPU load
0
20
40
60
80
100
120
140
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
CPU Load over Time
Peak Average
0
10
20
30
40
50
60
70
80
90
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
CPU Load over Time
Peak Average

Q&A

Interactive Voice Con: Optimizing Voice Processing for Smart Speakers & Devices

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Interactive Voice Con: Optimizing Voice Processing for Smart Speakers & Devices

Similaire à Interactive Voice Con: Optimizing Voice Processing for Smart Speakers & Devices (20)

Dernier

Dernier (20)

Interactive Voice Con: Optimizing Voice Processing for Smart Speakers & Devices