How to Correlate Two Audio Events (Detect if they are Similar) in Python

Question

For my project I have to detect if two audio files are similar and when the first audio file is contained in the second. My problem is that I tried to use librosa the numpy.correlate. I don't know if I'm doing it in the right way. How can I detect if audio is contained in another audio file?

import librosa
import numpy
long_audio_series, long_audio_rate = librosa.load("C:\\Users\\Jerry\\Desktop\\long_file.mp3")
short_audio_series, short_audio_rate = librosa.load("C:\\Users\\Jerry\\Desktop\\short_file.mka")

for long_stream_id, long_stream in enumerate(long_audio_series):
    for short_stream_id, short_stream in enumerate(short_audio_series):
        print(numpy.correlate(long_stream, short_stream))

What kind of audio are these events? How long is a typical event? — Jon Nordby, Aug 02 '19 at 08:39
@jonnor 30 minutes is long_audio and the short audio 1:30 minute — Jerry Palmiotto, Aug 02 '19 at 09:51

Hendrik · Answer 1 · 2019-08-02T14:38:22.327

Simply comparing the audio signals long_audio_series and short_audio_series probably won't work. What I'd recommend doing is audio fingerprinting, to be more precise, essentially a poor man's version of what Shazam does. There is of course the patent and the paper, but you might want to start with this very readable description. Here's the central image, the constellation map (CM), from that article:

If you don't want to scale to very many songs, you can skip the whole hashing part and concentrate on peak finding.

So what you need to do is:

Create a power spectrogram (easy with librosa.core.stft).
Find local peaks in all your files (can be done with scipy.ndimage.filters.maximum_filter) to create CMs, i.e., 2D images only containing the peaks. The resulting CM is typically binary, i.e. containing 0 for no peaks and 1 for peaks.
Slide your query CM (based on short_audio_series) over each of your database CM (based on long_audio_series). For each time step count how many "stars" (i.e. 1s) align and store the count along with the slide offset (essentially the position of the short audio in the long audio).
Pick the max count and return the corresponding short audio and position in the long audio. You will have to convert frame numbers back to seconds.

Example for the "slide" (untested sample code):

import numpy as np

scores = {}
cm_short = ...  # 2d constellation map for the short audio
cm_long = ...   # 2d constellation map for the long audio
# we assume that dim 0 is the time frame
# and dim 1 is the frequency bin
# both CMs contains only 0 or 1
frames_short = cm_short.shape[0]
frames_long = cm_long.shape[0]
for offset in range(frames_long-frames_short):
    cm_long_excerpt = cm_long[offset:offset+frames_short]
    score = np.sum(np.multiply(cm_long_excerpt, cm_short))
    scores[offset] = score
# TODO: find the highest score in "scores" and
# convert its offset back to seconds

Now, if your database is large, this will lead to way too many comparisons and you will also have to implement the hashing scheme, which is also described in the article I linked to above.

Note that the described procedure only matches identical recordings, but allows for noise and slight distortion. If that is not what you want, please define similarity a little better, because that could be all kinds of things (drum patterns, chord sequence, instrumentation, ...). A classic, DSP-based way to find similarities for these features is the following: Extract the appropriate feature for short frames (e.g. 256 samples) and then compute the similarity. E.g., if harmonic content is of interest to you, you could extract chroma vectors and then calculate a distance between chroma vectors, e.g., cosine distance. When you compute the similarity of each frame in your database signal with every frame in your query signal you end up with something similar to a self similarity matrix (SSM) or recurrence matrix (RM). Diagonal lines in the SSM/RM usually indicate similar sections.

Usually the problem is formulated as "querying a database of audio documents with a sample". If you only have one *long* file, than that's your database. Your short file is your query. — Hendrik, Aug 02 '19 at 13:36
How can I slide my CM for match with the query, Sorry I am beginner in audio processing? — Jerry Palmiotto, Aug 02 '19 at 13:40
Create a CM for your long document and for your short document. Using numpy slicing, create an excerpt from the long document that is as long as you short document. Then simply `np.multiply` the two images and `np.sum` the result. That's your count. Now, to *slide*, choose a different excerpt from the long CM, shifted by one frame, and so on. — Hendrik, Aug 02 '19 at 13:45
Last question,How can I CM of peaks with only two audio files? — Jerry Palmiotto, Aug 02 '19 at 20:03
Each audio file must be converted to a "constellation map" (CM)—you know, it's just a metaphor. It's really just peaks in the spectrogram. — Hendrik, Aug 02 '19 at 20:06

How to Correlate Two Audio Events (Detect if they are Similar) in Python

1 Answers1

Linked