[Tutorial] Computational Approaches to Melodic Analysis of Indian Art Music
1. Computational Approaches to Melodic
Analysis of Indian Art Music
Indian Institute of Sciences, Bengaluru, India 2016
Sankalp Gulati
Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain
4. Tonic Identification
time (s)
Frequency(Hz)
0 1 2 3 4 5 6 7 8
0
1000
2000
3000
4000
5000
100 150 200 250 300
0
0.2
0.4
0.6
0.8
1
Frequency (bins), 1bin=10 cents, Ref=55 Hz
Normalizedsalience
f2
f3
f4
f
5f6
Tonic
Signal processing Learning
q Tanpura / drone background sound
q Extent of gamakas on Sa and Pa svara
q Vadi, sam-vadi svara of the rāga
S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic identification in Indian art music: approaches
and evaluation. Journal of New Music Research, 43(01):55–73, 2014.
Salamon, J., Gulati, S., & Serra, X. (2012). A multipitch approach to tonic identification in Indian classical music. In Proc. of Int. Conf. on Music
Information Retrieval (ISMIR) (pp. 499–504), Porto, Portugal.
Bellur, A., Ishwar, V., Serra, X., & Murthy, H. (2012). A knowledge based signal processing approach to tonic identification in Indian classical music. In 2nd
CompMusic Workshop (pp. 113–118) Istanbul, Turkey.
Ranjani, H. G., Arthi, S., & Sreenivas, T. V. (2011). Carnatic music analysis: Shadja, swara identification and raga verification in Alapana using stochastic
models. Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE Workshop , 29–32, New Paltz, NY.
Accuracy : ~90% !!!
5. Tonic Identification: Multipitch Approach
q Audio example:
q Utilizing drone sound
q Multi-pitch analysis
Vocals
Drone
J. Salamon, E. G´omez, and J. Bonada. Sinusoid extraction and salience function design for predominant melody
estimation. In Proc. 14th Int. Conf. on Digital Audio Effects (DAFX-11), pages 73–80, Paris, France, Sep. 2011.
10. Tonic Identification: Signal Processing
q Harmonic summation
§ Spectrum considered: 55-7200 Hz
§ Frequency range: 55-1760 Hz
§ Base frequency: 55 Hz
§ Bin resolution: 10 cents per bin (120
per octave)
§ N octaves: 5
§ Maximum harmonics: 20
§ Square cosine window across 50 cents
Bin salience mapping
Harmonic summa<on
11. Tonic Identification: Signal Processing
q Tonic candidate generation
§ Number of salience peaks per
frame: 5
§ Frequency range: 110-550 Hz
Mul<-pitch
histogram
12. Tonic Identification: Feature Exraction
q Identifying tonic in correct octave using multi-pitch
histogram
q Classification based template learning
q Class of an instance is the rank of the tonic
100 150 200 250 300 350 400
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Frequency bins (1 bin = 10 cents), Ref: 55Hz
Normalizedsalience
Multipitch Histogram
f2
f3
f4
f5
14. Tonic Identification: Results
S. Gulati, A. Bellur, J. Salamon, H. Ranjani, V. Ishwar, H.A. Murthy, and X. Serra. Automatic tonic
identification in Indian art music: approaches and evaluation. Journal of New Music Research, 43(01):
55–73, 2014.
16. Pitch Estimation Algorithms
q Time-domain approaches
§ ACF-based (Rabiner 1977)
§ AMDF-based (YIN) Cheveigné et al.
q Frequency-domain approaches
§ Two-way mismatch (Maher and
Beauchamp 1994)
§ Subharmonic summation (Hermes 1988)
Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 25(1), 24–33
De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society
of America 111, no. 4 (2002): 1917-1930.
§ Multi-pitch approaches
§ Source separation-based (Klapuri, 2003)
§ Harmonic summation (Melodia) (Salamon and
Gómez, 2012)
Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48.
Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The
Journal of the Acoustical Society of , 95 (April), 2254–2263.
Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264.
Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE
Transactions on Speech and Audio Processing, 11(6), 804–816.
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
17. Pitch Estimation Algorithms
q Time-domain approaches
§ ACF-based (Rabiner 1977)
§ AMDF-based (YIN) Cheveigné et al.
q Frequency-domain approaches
§ Two-way mismatch (Maher and
Beauchamp 1994)
§ Subharmonic summation (Hermes 1988)
Rabiner, L. (1977, February). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal
Processing, 25(1), 24–33
De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical
Society of America 111, no. 4 (2002): 1917-1930.
§ Multi-pitch approaches
§ Source separation-based (Klapuri, 2003)
§ Harmonic summation (Melodia) (Salamon
and Gómez, 2012)
Medan, Y., & Yair, E. (1991). Super resolution pitch determination of speech signals. IEEE transactions on signal processing, 39(1), 40–48.
Maher, R., & Beauchamp, J. W. (1994). Fundamental frequency estimation of musical signals using a two-way mismatch procedure. The
Journal of the Acoustical Society of , 95 (April), 2254–2263.
Hermes, D. (1988, 1988). Measurement of pitch by subharmonic summation. Journal of the Acoustical Society of America, 83, 257 - 264.
Klapuri, A. (2003b, November). Multiple fundamental frequency estimation based on harmonicity and spectral smoothness. IEEE
Transactions on Speech and Audio Processing, 11(6), 804–816.
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics.
IEEE Transactions on Audio, Speech, and Language Processing, 20(6), 1759–1770.
18. Predominant Pitch Estimation: YIN
Signal
Difference function
Auto-correlation
Cumulative difference
function
rt͑͒ϭ ͚jϭtϩ1
tϩW
xjxjϩ, ͑1͒
where rt() is the autocorrelation function of lag calculated
at time index t, and W is the integration window size. This
function is illustrated in Fig. 1͑b͒ for the signal plotted in
Fig. 1͑a͒. It is common in signal processing to use a slightly
different definition:
rtЈ͑͒ϭ ͚jϭtϩ1
tϩWϪ
xjxjϩ. ͑2͒
Here the integration window size shrinks with increasing
values of , with the result that the envelope of the function
decreases as a function of lag as illustrated in Fig. 1͑c͒. The
FIG. 1. ͑a͒ Example of a speech waveform. ͑b͒ Autocorrelation function
͑ACF͒ calculated from the waveform in ͑a͒ according to Eq. ͑1͒. ͑c͒ Same,
calculated according to Eq. ͑2͒. The envelope of this function is tapered to
zero because of the smaller number of terms in the summation at larger .
FIG. 2. F0 estimation error rates as a function of the slope of the envelope
of the ACF, quantified by its intercept with the abscissa. The dotted line
represents errors for which the F0 estimate was too high, the dashed line
those for which it was too low, and the full line their sum. Triangles at the
right represent error rates for ACF calculated as in Eq. ͑1͒ (maxϭϱ). These
rates were measured over a subset of the database used in Sec. III.
Lag (samples)
The present article introduces a method for F0 estima-
tion that produces fewer errors than other well-known meth-
ods. The name YIN ͑from ‘‘yin’’ and ‘‘yang’’ of oriental
philosophy͒ alludes to the interplay between autocorrelation
and cancellation that it involves. This article is the first of a
rt͑͒ϭ
where rt(
at time ind
function is
Fig. 1͑a͒. I
different d
rtЈ͑͒ϭ
Here the
values of
decreases
two definit
side ͓tϩ1,
this articl
‘‘modified
correlation
In resp
multiples
FIG. 1. ͑a͒ Example of a speech waveform. ͑b͒ Autocorrelation function
͑ACF͒ calculated from the waveform in ͑a͒ according to Eq. ͑1͒. ͑c͒ Same,
calculated according to Eq. ͑2͒. The envelope of this function is tapered to
zero because of the smaller number of terms in the summation at larger .
The horizontal arrows symbolize the search range for the period.
FIG. 2. F0 e
of the ACF,
represents er
those for wh
right represen
rates were m
max . The parameter max allows the algorithm to be biased
to favor one form of error at the expense of the other, with a
minimum of total error for intermediate values. Using Eq. ͑2͒
rather than Eq. ͑1͒ introduces a natural bias that can be tuned
by adjusting W. However, changing the window size has
other effects, and one can argue that a bias of this sort, if
useful, should be applied explicitly rather than implicitly.
This is one reason to prefer the definition of Eq. ͑1͒.
The autocorrelation method compares the signal to its
shifted self. In that sense it is related to the AMDF method
͑average magnitude difference function, Ross et al., 1974;
Ney, 1982͒ that performs its comparison using differences
rather than products, and more generally to time-domain
methods that measure intervals between events in time
͑Hess, 1983͒. The ACF is the Fourier transform of the power
spectrum, and can be seen as measuring the regular spacing
of harmonics within that spectrum. The cepstrum method
͑Noll, 1967͒ replaces the power spectrum by the log magni-
tude spectrum and thus puts less weight on high-amplitude
parts of the spectrum ͑particularly near the first formant that
often dominates the ACF͒. Similar ‘‘spectral whitening’’ ef-
fects can be obtained by linear predictive inverse filtering or
center-clipping ͑Rabiner and Schafer, 1978͒, or by splitting
the signal over a bank of filters, calculating ACFs within
each channel, and adding the results after amplitude normal-
ization ͑de Cheveigne´, 1991͒. Auditory models based on au-
tocorrelation are currently one of the more popular ways to
The same is true after taking the square and averaging over a
window:
͚jϭtϩ1
tϩW
͑xjϪxjϩT͒2
ϭ0. ͑5͒
Conversely, an unknown period may be found by forming
the difference function:
dt͑͒ϭ ͚jϭ1
W
͑xjϪxjϩ͒2
, ͑6͒
and searching for the values of for which the function is
zero. There is an infinite set of such values, all multiples of
the period. The difference function calculated from the signal
in Fig. 1͑a͒ is illustrated in Fig. 3͑a͒. The squared sum may
FIG. 3. ͑a͒ Difference function calculated for the speech signal of Fig. 1͑a͒.
͑b͒ Cumulative mean normalized difference function. Note that the function
starts at 1 rather than 0 and remains high until the dip at the period.
size was 25 ms, window shift was one sample, search range was 40 to 800
Hz, and threshold ͑step 4͒ was 0.1.
Version Gross error ͑%͒
Step 1 10.0
Step 2 1.95
Step 3 1.69
Step 4 0.78
Step 5 0.77
Step 6 0.50
Lag (samples)
ed
a
od
re
ow
00
sed
h a
͑2͒
ned
has
if
tly.
its
hod
74;
ces
ain
The same is true after taking the square and averaging over a
FIG. 3. ͑a͒ Difference function calculated for the speech signal of Fig. 1͑a͒.
͑b͒ Cumulative mean normalized difference function. Note that the function
starts at 1 rather than 0 and remains high until the dip at the period.
hod
were
dow
800
Lag (samples)
max . The parameter max allows the algorithm to be biased
to favor one form of error at the expense of the other, with a
minimum of total error for intermediate values. Using Eq. ͑2͒
rather than Eq. ͑1͒ introduces a natural bias that can be tuned
by adjusting W. However, changing the window size has
other effects, and one can argue that a bias of this sort, if
useful, should be applied explicitly rather than implicitly.
This is one reason to prefer the definition of Eq. ͑1͒.
The autocorrelation method compares the signal to its
shifted self. In that sense it is related to the AMDF method
͑average magnitude difference function, Ross et al., 1974;
Ney, 1982͒ that performs its comparison using differences
rather than products, and more generally to time-domain
The same is true after taking the square and averaging over a
window:
FIG. 3. ͑a͒ Difference function calculated for the speech signal of Fig. 1͑a͒.
͑b͒ Cumulative mean normalized difference function. Note that the function
starts at 1 rather than 0 and remains high until the dip at the period.
TABLE I. Gross error rates for the simple unbiased autocorrelation method
͑step 1͒, and for the cumulated steps described in the text. These rates were
measured over a subset of the database used in Sec. III. Integration window
size was 25 ms, window shift was one sample, search range was 40 to 800
Hz, and threshold ͑step 4͒ was 0.1.
Version Gross error ͑%͒
Step 1 10.0
Step 2 1.95
Step 3 1.69
Step 4 0.78
Step 5 0.77
Step 6 0.50
Lag (samples)
De Cheveigné, A., and Kawahara, H., "YIN, a fundamental frequency estimator for speech and music." The Journal of the
Acoustical Society of America 111, no. 4 (2002): 1917-1930.
20. Predominant Pitch Estimation: Melodia
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
21. Predominant Pitch Estimation: Melodia
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
22. Predominant Pitch Estimation: Melodia
audio
Spectrogram
Spectral peaks
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
23. Predominant Pitch Estimation: Melodia
Spectral peaks
Time-frequency
salience
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
24. Predominant Pitch Estimation: Melodia
Time-frequency
salience
Salience peaks
Contours
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.
25. Predominant Pitch Estimation: Melodia
Contours
Predominant
melody contours
Salamon, J., & Gómez, E. (2012, August). Melody Extraction From Polyphonic Music Signals Using Pitch Contour Characteristics. IEEE
Transactions on Audio, Speech, and Language Processing, 20 (6), 1759–1770.