Mulitplication in Frequency Domain as a Similarity Measure of Audio Clips

Question

I have an audio clip and I copy a 256-point part satarting at 10th second. Then I create a 256-point frame on the original clip. At every itreation, I calculate the DFT of the frame, multiply it with the DFT of the interval I copied, take the mean of the result and append it to a list and shift the frame 1 point right. When I graph the means, I know I should get the maximum value of convolution when the copied interval's DFT is multiplied with itself. And this is supposed to show that convolution can be used as a similarity metric. However, when I apply the explained procedure, I can't get a peak that is distinguishable from any other value. What I have done is as follows:

    import numpy as np
    import scipy.io.wavfile
    import matplotlib.pyplot as plt
rate1, data1 = scipy.io.wavfile.read('Africa.wav')

data1 = np.array(data1, dtype=np.float64)

interval_1 = data1[rate1 * 10: rate1 * 10 + 256]

dft_1 = np.fft.fft(interval_1)

cv = []

for i in range(data1.size - 256):
    dft = np.fft.fft(data1[i: i + 256])
    Y = np.multiply(dft, dft_1)
    Y = np.abs(Y)
    Y = np.mean(Y)
    cv.append(Y)

cv = np.array(cv)

plt.figure()
plt.title(&quot;Convolution with 256 point sample at 10th second&quot;)
plt.xlabel(&quot;Samples&quot;)
plt.ylabel(&quot;Amplitude&quot;)
plt.plot(cv)
plt.show()

I get the following graph:

Dan Boschen · Answer 1 · 2020-06-26T11:58:24.453

1

The fft product alone is not the convolution, but the frequency domain of the convolution. To complete the operation the OP must also take the inverse FFT to get a circular convolution result.

$$CONV = \text{ifft}(\text{fft}(a) \text{fft}(b))$$

However, similarity would be determined using correlation not convolution. To do this, simply complex conjugate one of the FFT results as follows:

$$XCORR = \text{ifft}(\text{fft}(a) \text{fft}^*(b))$$

The above is the cross-correlation function (using circular convolution). The result is the correlation of $a$ and $b$ at repeated circular shifts in time.

edited Jun 26 '20 at 11:58

answered Jun 25 '20 at 14:20

Dan Boschen

50,942
2
57
135

when I use dft_1's conjugate, use ifft and remove the np.abs() so that I'm calculating xcorr, I am getting a very similar graph. Is taking the mean wrong? – Rookie Jun 25 '20 at 15:58
And can you look at this question https://dsp.stackexchange.com/questions/54970/comparing-multiple-signals-for-similarity . It is stated that sum of multiplication of dfts can be used for similarity. – Rookie Jun 25 '20 at 16:04
1

@Rookie Test your approach using the same sequence (auto correlation) to see if you are doing things correctly. – Dan Boschen Jun 25 '20 at 16:34
@Rookie re other post—- regardless of domain (freq or time) “correlation” between any two waveforms is the process of multiply and sum (but to be very clear it must be a complex conjugate multiplication if the waveforms are complex). To the degree the waveforms are similar on a sample by sample basis, you will get a strong correlation —- if the waveforms are offset from each other in whatever domain they are in, this will change the correlation since it is sample by sample only. – Dan Boschen Jun 25 '20 at 16:40
Well, taking the mean doesn't seem to give a reliable way, since sometimes mean of auto correlation is less than correlation with another interval. – Rookie Jun 25 '20 at 18:32
@Rookie where did the suggestion to take the mean or the autocorrelation come from? The complex conjugate product and sum would be the correlation of the two frequency domain waveforms —- autocorrelation is doing the repeatably for every possible offset in samples between the two waveforms, but why would you then take the mean of that result? – Dan Boschen Jun 25 '20 at 18:50
"The complex conjugate product and sum would be the correlation of the two frequency domain waveforms" The average suggestion came from my professor, and it gives a plot with the same shape as the sum. And I tried both of them. All I do is np.abs(sum) after the summation. It results in a large value when it is multiplied with itself, but apparently at some other points in the music file, it yields other large, even larger values. – Rookie Jun 25 '20 at 19:22
@Rookie It sounds like you have convinced yourself that taking the mean like that would not do what you are trying; and isn’t anything I would have suggested. If you want to measure the similarity between two waveforms- that is correlation which I already detailed in my answer. There are applications where you may be interested in the similarity versus a delay in time - that is the cross correlation function which is a plot of the correlation versus delay between the waveforms. – Dan Boschen Jun 25 '20 at 20:34
What exactly you are trying to do in the end? Is it that you have one sound file and then a small sound bite that you are trying to find within that file? – Dan Boschen Jun 25 '20 at 20:37
Sorry if I was unable to explain. What I am trying to achieve is as you said is finding a small sound bite within a larger file. After making the corrections you suggested (taking complex conjugate and multiplying in the frequency domain, then summing up) I expect to get a peak in the graph where the bite of sound coincides with itself, and low values elsewhere. But I still get a graph as in the question. I couldn't really understand why the result is so. And also thanks for sparing time to answer. – Rookie Jun 26 '20 at 09:33
1

@Rookie whether in the time or frequency domain, you are going to have to apply this in a "sliding window" kind of way, is this understood? We are not talking about doing this "multiplication" just once. Is that clear? – A_A Jun 26 '20 at 10:16
@A_A but would have totally different implications--- in the time domain we would do a sliding window if we are looking for time alignment, but in the frequency domain we would do a sliding window if we were looking for frequency alignment (such as a Doppler shift if we suspect that may have happened). The sliding window in frequency will not provide any time shift correlation and in a typical receiver we do both (not necessarily this way but we do search for signals in both frequency and time). – Dan Boschen Jun 26 '20 at 11:55
@A_A Also thanks to your comment I made it further clear that this sliding window is already built into the FFT approach given in my answer (as a circular shift) just to not confuse that the operations I gave in my answer would need to be repeated in any sort of sliding window. – Dan Boschen Jun 26 '20 at 11:59
1

D.B. I may have not put my comment at the right place. I was refering to the OP. It is not an objection to your response. Purely going by the description provided (looking for small segment in a bigger stream), whether performed in the time or frequency domain, the correlation would have to be applied in a sliding window way. That was the message. @Rookie,I agree that you seem to be occupied with some "average".Please, don't.There is no averaging here. Just a sum that will shoot to a high value when the segments allign.Try to do the same by applying plain simple correlation first. – A_A Jun 26 '20 at 12:10
@A_A oh no problem, I wasn't being defensive but my comments are applicable in an attempt to provide helpful additions to your good comments. I think it is good to further understand the implications of "sliding" in either domain – Dan Boschen Jun 26 '20 at 12:15
D.B No worries, I understand – A_A Jun 26 '20 at 12:16

Mulitplication in Frequency Domain as a Similarity Measure of Audio Clips

1 Answers1