St Slides

Sound Texture: Wavelet Tree
Learning and Tiling and Stitching
Antonio De Sena and Pietro Polotti
desena@sci.univr.it, polotti@sci.univr.it

`
Universita degli Studi di Verona

`
Universita degli Studi di Verona – p. 1/34

Goals
Illustrating different definitions for sound textures
proposed by different authors.
Present the basic ideas of two different approaches
for analyzing and synthesizing sound textures.
Stimulating from the audience a proposal of
definition/classification of audio/sound texture.

`

Definitions (1)

Definition by Dubnov et al., Hebrew University,
Jerusalem, Israel (2002) [1]:
“We can describe sound textures as a set of
repeating structural elements (sound grains) subject
to some randomness in their time appearance and
relative ordering but preserving certain essential
temporal coherence and across-scale localization.”
Ex: “. . . natural and artificial sounds such as rain, a
waterfall, traffic noises, people babble, machine
noises, and so on.”
Fundamental assumption: “. . . the sound signals are
approximately stationary at some scale.”
Comment: this is according to a precise analytical
tool.
`

Definitions (2)

Definition by Dubnov and Tishby, Hebrew
University, Jerusalem, Israel (1997) [2]:
“Sound texture can be considered as stationary
acoustical phenomena that obtain their acoustical
effects from internal variations in the sound
structure.”

Variations like:
“. . . micro-fluctuations in the harmonics of a pitched
sound or statistical properties of random excitation
source in an acoustic system.”

`

Definitions (3)

Definition by Parker and Behm, University of
Calgary, Canada (2004) [3]:
“A sound texture can be described as having a
somewhat random character (?), but a recognizable
quality. Any small (?) sample of a sound texture
should sound very much like, but not identical to, any
other small sample.”
Comment: This is very qualitative.
Definition by Norris and Denham, University of
Plymouth, (2003) [4]:
“A sound texture may be loosely defined as a sound
which may have some local structure, but has no
perceptually obvious long-term structure.”
Comment: This is rather vague.
`

Definitions (4)

Definition by Athineos and Ellis, Columbia
University, U.S.A. (2003) [5]:
“. . . we look at a third class of sounds we call sound
textures that are distinct from speech and music.”

“. . . textures should have an undetermined extent
(duration) with consistent properties (at some level),
and be readily identifiable from a small (?) sample.”

Comment: “. . . consistent properties”, is a bit vague.
Comment: “. . . identifiable from a small sample”
seems to be a perceptual criterion (?).
They consider the existence of a global structure in
time.
`

Two different approaches
Creating Sound Textures by Example: Tiling and
Stitching.
Starting from image processing methods (tiling and
stitching) the Parker and Behm [1] developed a new
method for creating sound textures.
Creating Sound Textures through Wavelet Tree
Learning.
Starting from an image processing method
developed in [4], Dubnov et al. extend this method to
the case of audio signal for the creation of sound
textures.

`

Creating Audio Texture by Example: Tiling and Stitching (1)

Deﬁnition by Parker and Behm, University of
Calgary, Canada (2004) [3]:
“A sound texture can be described as having a
somewhat random character, but a recognizable
quality. Any small sample of a sound texture should
sound very much like, but not identical to, any other
small sample.”
Comment: This is very qualitative.
Examples: waterfall, rain, trafﬁc noises . . .
For every chunk, the frequency distribution should
not change, nor should any rhythmical pattern or
timbre characterization.

`


Tiling and Stitching based methods.
Image Quilting (image processing).
Square sample blocks with ﬁxed size.
Overlap between adjacent blocks.
Select blocks that have some signiﬁcant measure of
agreement between them a .
Smoothing edges for reducing “mosaic” effect a .

a
No more information provided.

`


Tiling and Stitching based methods.
Chaos Mosaic (image processing).
Start with only one block.
Image will be created copying the block (tiling) to ﬁll the
requested size.
A chaos transformation need to be applied;
es: Arnold’s Cat Map:

xl+l = (xl+l + y l+l ) mod m
y l+l = (xl+l + 2y l+l ) mod m

This transformation maps the output image onto itself.
Where image size is m × m, and the iteration number is l.
Applied to blocks of pixels (and not on single pixel, to
preserve local features).
Smoothing edges (or fade) for reducing “mosaic” effect.

`


Stitching based methods: generation from a sound
texture sample.
The sound texture need to be separated in blocks of
equal duration.
Using this blocks a bigger sample can be created.
A least square measure is used to ﬁnd blocks whose head
(ﬁrst 15%) is similar to the tail (last 15%) of the previous one.
Blocks is chosen using a LRU (Least Recently Used)
algorithm (in combination with least square measure) to
“forcing” the procedure to pick up all the chunks.
Chunks are cross-faded (15%).

`


Chunk size can be determinate using amplitude
peaks. The entire source sample is analyzed for
RMS amplitude and peaks in amplitude with more
than 1.5 standard deviation from the baseline are
recorded. The mean and standard deviation of the
observed distances between these peaks is used to
generate the size of each chunk.
Hopefully, then, each chunk will contain one “feature”
that a listener can recognize.

`


Tiling based methods: (chaos mosaic) generation
from a sound texture sample.
Make a matrix with row exactly large enough to hold
one period at the “dominant” (?) frequency, or an
integer number of periods.
Fill the matrix with the sample (row by row).
Partition it in rectangular regions. Width of these
regions is computed using the dominant frequency
(ex: width= n · Fd , with n 150).
The corner of the regions are randomly moved using
a normal function (with d = 15%) of the box size.

`


Arnold’s Cat Map is applied with blocks one half
smaller to create a background.
The background is necessary because the next step can
leaves “holes” in the wave.
Arnold’s Cat Map is applied at normal size blocks
(overlap without add at the background).

`


Comments.
Idea: handicraft work.
Two ideas readapted from image processing.
Textures: not bad, there are some problem like
rhythmical patterns, time-envelope problems,
repetitions.

`


Appendix: Synthesis with Gaussian Pyramid.
Again, an idea taken from image processing.
A wavelet-like pyramid (MRA tree) done with a
gaussian ﬁlter (lowpass) and the difference between
original and ﬁltered signal (details, bandpass).
No full description available. Only a single page brief
explanation available on
http://pages.cpsc.ucalgary.ca/
~parker/gamesresearch/tsketch-texture.pdf .

`

Sounds examples and comments

Creating Audio Texture by Example: Tiling and
Stitching.
Crowd (audience): macro-evident repetitiveness (examples
sound as juxtaposed reiterated patterns).
Time envelope problems: “volume discontinuity”.
Fire: less macro-evident repetitiveness (sound examples of
juxtaposed repeated patterns).
No time envelope problems.
Water: the “chaos” example is the best.
Other examples: time-envelope problems feeling of “volume
discontinuities”.
In general it is the least repetition-like.
Surf and gulls: Block copy, obtained with small windows, thus
less repetition of temporal (almost rhythmical) patterns.

`

Synthesizing Sound Textures through Wavelet Tree Learning (1)

Definition by Dubnov et al., Hebrew University,
Jerusalem, Israel [1]:
“We can describe sound textures as a set of
repeating structural elements (sound grains) subject
to some randomness in their time appearance and
relative ordering but preserving certain essential
temporal coherence and across-scale localization.”
Ex: “. . . natural and artificial sounds such as rain, a
waterfall, traffic noises, people babble, machine
noises, and so on.”
Fundamental assumption: “. . . the sound signals are
approximately stationary at some scale” (?).
Comment: this is according to a precise analytical
tool.
`


Gabor theory: sound is perceived as a series of
short discrete burst of energy.
A further assumption: in a sound texture, a
statistical characterization of the joint
time-frequency and/or time-scale relations is
possible.

`


Original idea developed for image (2D) and video
(3D) textures [6].
Examples on next slides extracted from:
http://www.cs.huji.ac.il/labs/cglab/papers/texsyn/2dtexsyn/

The audio (1D) is an adaptation of the original
studies.
More works to do in order to avoid silence gaps, too
much similar portions, . . .

`


Original texture, same size synthesized texture, 4 times larger synthesized texture.

`


Statistical Learning.
Estimating the stochastic source with a training
example (a “sample” of the source).
El-Yaniv algorithm: generate new random
sequences that could have been generated from the
source of the sample.
The new sequences are generated by synthetic wavelet
coefﬁcients.
The wavelet coefﬁcients are obtained by following some
statistically constrained paths in the analysis wavelet tree.

`


Wavelet MRA Tree.
Using a Daubechies wavelet an analysis tree is built.
The Daubechies has been chosen because
“this wavelet has several superior properties compared to
other orthonormal wavelets, especially with respect to
translation and rotation invariance, aliasing, and robustness
due to its nonorthogonality and redundancy” (?).
Each MRA tree node stores the coefﬁcients of the
Daubechies Wavelet at a speciﬁc scale.

`


Learning.
Each coefﬁcient depends on its scale ancestor
(upper level) and temporal predecessor (those to its
left).
Using an algorithm by El-Yaniv [1], the conditional
probability along the tree path (scale) can be learnt.
A second learn is done using the neighboring node
(time) for preserving time structure.

`


Synthesizing.
Thus, the signal can be viewed as a collection of
paths from the root of the tree toward the leaves.
The goal is to generate new tree whose paths are
typical sequences generated by the same source, by
creating new (candidate) nodes (children) for a node
vi .
First the algorithm copy the root and the nodes of the level 1
in the new tree.
Now let’s assume that we have already generated the ﬁrst i
levels of the new tree. To generate the next level we must
add two children nodes to each node v i in level i.
The algorithm search among all nodes at i-th level of the
tree for nodes wi with maximal-length ε-similar (El-Yaniv, ε is
a user threshold) path sufﬁxes w i−1 , wi−2 . . . wj .

`


Synthesizing.
Among these candidate the algorithm look for those nodes
whose kth (k is a user parameter) predecessor (the nodes
on the left in the same level) resemble those of v i children.
The algorithm then randomly chooses a candidate and
copies the values to the node v i .

`


Comments.
Theoretical approach: very interesting mathematical
background.
Experimental results: silence gap and pattern
repetitions.
Results with images are better, but probably because
image perception is different from audio perception.

`

Sounds examples

Synthesizing Sound Textures through Wavelet Tree
Learning.
Baby crying.
Shores.
Trafﬁc jam.
Their textures have a strong rhythmical or temporal articulation.
All the examples show the same problems: The macro-tiles
seems to be generated by the same set of “randomly” chosen
coefﬁcients, resulting in unnatural-sounding repetitions.

`

Sound Texture Modelling with CFTLP (1)

Deﬁnition by Athineos and Ellis, Columbia
University, U.S.A. [5]:
“. . . we look at a third class of sounds we call sound
textures that are distinct from speech and music.”

“. . . textures should have an undetermined extent
(duration) with consistent properties (at some level),
and be readily identiﬁable from a small sample.”

`


Idea: to model texture as rapidly-modulated noise
by using two linear predictors in cascadea .
The first, operating in the time domain, is a normal
LPC analysis and captures the spectral envelope.
The second, in the frequency domain (operating on
the residual of the previous LPC analysis), captures
the time envelope, i.e. the time structure.
Textures can be synthesized using a filtered
Gaussian noise, which feed the cascade of filters
whose coefficients where obtained by the analysis
of the original texture sample.

a
A quite identical idea can be found on [7].

`


CTFLP analysis (up) and synthesis (down) block diagrams.

`

References (1)

[1] Dubnov, S.; Bar-Joseph, Z.; El-Yaniv, R.; Lischinski, D.; Werman,
M.;: Synthesizing sound textures through wavelet tree learning.,
Computer Graphics and Applications, IEEE , Volume: 22 , Issue: 4,
pp. 38-48, (July-Aug. 2002).
[2] Dubnov, S.; Tishby, N.;: Analysis of sound textures in musical and
machine sounds by means of higher order statistical features.,
Acoustics, Speech, and Signal Processing, 1997. ICASSP-97., 1997
IEEE International Conference on , Volume: 5, pp. 3845-3848,
(21-24 April 1997).
[3] Parker, J.R.; Behm, B.;: Creating audio textures by example: tiling
and stitching., Acoustics, Speech, and Signal Processing, 2004.
Proceedings. (ICASSP ’04). IEEE International Conference on ,
pp:iv-317 - iv-320 vol.4, (17-21 May 2004).

`

References (2)

[4] Michael Norris; Sue Denham;: Sound texture detection using Self
Organizing Maps., Centre for Theoretical and Computation
Neuroscience, University of Plymounth, UK, (Nov 2003).
[5] Athineos, M.; Ellis, D.P.W.;: Sound texture modelling with linear
prediction in both time and frequency domains., Acoustics, Speech,
and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003
IEEE International Conference on , Volume: 5, pp. 648-51, (6-10
April 2003).
[6] Z. Bar-Joseph et al.;: Texture Mixing and Texture Movie Synthesis
Using Statistical Learning., IEEE Trans. Visualization and Computer
Graphics, vol. 7, no. 2, pp. 120-135, (Apr.-Jun. 2001).
[7] Zhu, X.L.; Wyse, L.;: Sound texture modeling and time-frequency
LPC., Proceedings of the Conf. on Digital Audio Effects (DAFX-04),
Napels, Italy, (5-8 October 2004).

`

St Slides

Recommandé

Recommandé

Contenu connexe

Similaire à St Slides

Similaire à St Slides (12)

St Slides