Why convert spectrogram to RGB for machine learning?

Question

I've seen a few publications that feed an RGB image of a spectrogram to a neural net, and someone claiming a network does better with RGB than grayscale or raw spectrogram.

A spectrogram is fundamentally a 2D representation with each point being a non-negative real value. Converting it to RGB adds no information. Worse, it introduces a dependence on choice of colormap, which is just noise${}^{1}$. It's worse than making grayscale images RGB, as it breaks a spectrogram's spatial dependencies by splitting into channels.

Why would a spectrogram saved as RGB ever outperform a raw spectrogram?

Clarification: originally I didn't realize this, but "RGB image" implies "image", meaning it involves a conversion step that compresses and reshapes the raw spectrogram. Additionally, it's not just any RGB in sense of $\text{R} \approx \text{G} \approx \text{B}$, but a color mapping for intensity heatmaps, like turbo (plt.imshow(np.arange(9)[None], cmap='turbo')): .

It's possible the image doesn't reshape, in which case there's no compression, but even if a colormap isn't specified, doesn't mean there's no color mapping: what matters is how the array values compare between the raw spectrogram and what's decoded as image as input to NN.

1: that was my impression at the time, it's the case with $R \approx G \approx B$, but otherwise definitely not. Depending on colormap, it can be noise (or worse) though.

Example pub with good results, but there's reason to suspect incompetence per e.g. "[1356x1071] images were lossless scaled to 32x32", which is impossible. There's no comparison with grayscale approach so we can't tell if it outperformed.

There are some "trivial" explanations I'll list to avoid answers containing them:

Transfer learning: using nets pretrained on RGB
Architectures tailored specifically to maximize RGB utility

While they're valid explanations, it's no evidence that RGB is any better.

Can you [edit] your post to share citations to these publications? — Sycorax, Jan 02 '22 at 21:30
I suspect the reason may be that they were using deep neural networks with pre-trained feature extraction layers that had been pre-trained on RGB images? Caveat: I am certainly not an expert on deep neural networks, better on shallow ones! — Dikran Marsupial, Jan 02 '22 at 22:09
@DikranMarsupial Yes, I suspect this as a "trivial" explanation - I'll make a short list. The linked pub doesn't mention pretraining though. — OverLordGoldDragon, Jan 02 '22 at 22:17
It is my understanding that often when someone says they used a particular architecture they often mean that they used that architecture pre-trained on some database (e.g. imagenet) as training it from scratch would be computationally extremely expensive, whereas it can be fine-tuned (transfer learning?) for some particular task fairly cheaply. However, as I said this isn't really my area, just read a few books. — Dikran Marsupial, Jan 02 '22 at 22:40
Looks like they used their own architecture, but it could well be that they were following a recipe, which is (sadly for me) a very common approach for DNNs. — Dikran Marsupial, Jan 02 '22 at 22:46
Figure 3 suggests that the DNN isn't a very good model, looks heavily over-fit to me. — Dikran Marsupial, Jan 02 '22 at 22:53
"saved as RGB" this might lead to confusion. I believe that your idea with this is to apply some sort of additional colour mapping when the raw data is saved as RGB. But in general, 'saving as RGB' just means to split the gray-scale layer into three red, green and blue layers. — Sextus Empiricus, May 13 '23 at 14:42
@SextusEmpiricus Thanks, let me know if there's other such details or if it's still confusion-prone. — OverLordGoldDragon, May 13 '23 at 14:48
"a conversion step that compresses and reshapes the raw spectrogram." Can you elaborate on what you mean by this. A raw spectrogram is already an image and can be safed as a tiff file or something else. Why would using a format in RGB relate to compression? An RGB format can actually increase the precision as it is a a data format with a larger capacity of information (at least the file size is bigger). — Sextus Empiricus, May 13 '23 at 14:52
"It's worse than making grayscale images RGB, as it breaks a spectrogram's spatial dependencies by splitting into channels." By recombining the channels, the original information can be retrieved. If there is a loss of information due to the colormapping, then it is only some slight rounding of errors. The distortion is not big. "An analogous question could be, 'why does this algorithm perform better when I apply a log transform, doesn't the transform distort the information?'" using the transform is effectively like using a different type of triggerfunctions in the neural network. — Sextus Empiricus, May 13 '23 at 14:58
@SextusEmpiricus I edited in points of confusion that don't relate to understanding of spectrograms, as that's separate. I'll reply about the rest in comments: "a raw spectrogram is already an image" only in the sense that any 2D array is "already an image". "Why would using a format in RGB relate to compression" due to reshaping. "larger capacity of information" this of course doesn't imply more information in synthesis/reconstruction sense, only generative methods or additional inputs can do that. It's purely a conversion like y = sqrt(x). — OverLordGoldDragon, May 13 '23 at 15:02
@SextusEmpiricus "By recombining the channels, the original info can be retrieved" by inverting the spectrogram, the original signal can be retrieved. Doesn't say much; "Analysis vs synthesis" here is relevant. And the kind of color mapping that's used certainly makes a big difference - as noted in my answer, it dedicates a separate channel to peaks, which correspond to amplitude and frequency modulation maxima. — OverLordGoldDragon, May 13 '23 at 15:07
@OverLordGoldDragon my last comments relate to statements like "it breaks a spectrogram's spatial dependencies by splitting into channels." (The network can reconstruct this) and "dependence on choice of colormap, which is just noise" (the noise is not big, there is no important loss of information asside from roundoff-errors, and it is just a different representation) — Sextus Empiricus, May 13 '23 at 15:09
@SextusEmpiricus You are mistaken. "Can recover" != no effect. As elaborated in my previous comment's reference, the whole point of feature engineering is to make the job easier for the NN by achieving desired properties of representation, otherwise just feed the raw input, in fact straight to a linear classifier. Breaking up into non-overlapping channels is a significant nonlinearity. Also I was very wrong with "just noise", I was thinking what you were thinking with R~G~B. Edited that also. — OverLordGoldDragon, May 13 '23 at 15:14

Sextus Empiricus · Answer 1 · 2023-05-08T08:33:21.953

2

A less trivial explanation can be that converting gray-scale to RGB is effectively adding a layer of ReLU neurons with fixed parameters.

For example converting an image to RGB using the viridis colour-map is using something similar to three piecewise linear functions that can be composed out of ReLU functions.

This addition has the effect of increasing the depth (extra layer) and width (potential extra neurons in subsequent layers) of the neural network. Both effects can potentially improve the performance of the model (if it's current depth and/or width was not sufficient).

Width

A simple example is converting a single grayscale channel to three rgb channels by simply copying the image three times.

This can be effectively like performing some ensemble learning.

Your neural network or decision tree may converge to different patterns on the different channels which can be later on merged in an average with a final layer or classification boundary.

You could also see it alternatively as effectively making several of the hidden layers three times wider (but not fully connecting them, and adding only three times more connections). This can create some potential for different training and convergence which is potentially better.

Depth

The additional color mapping layer may allow to create patterns that are not possible with less connections. The flexibility is increased.

The simplest example is an image of a single pixel that passes through a layer with a single neuron with a step function (so this is an example where even the number of neurons remain the same and the width of the subsequent network is not changed).

For BW, this is a two parameter function (weight $w_1$ and bias $b$) that effectively makes a classification based on whether or not the input is above or below some level.
For RGB, then we get two additional parameters, $w_2$ and $w_3$, for the extra channels, and this makes it possible to create more patterns. For example we can make a classification when the grayscale pixel has either a high or either low value.

Obviously one can achieve the same when not converting to rgb, and instead add more neurons or an additional layer.

But possibly the cases where the rgb performed better did not test this out.
Also the conversion to rgb, using some useful scale, is making a hardcoded seperation into shadows, middle tones and highlights, which a NN needs training and extra neurons for.

(So in a way it is adding an extra layer which is regularised. And also it is adding pre-trained information because the human decision to choose a particular colour map instead of another; ie the human chooses the trigger points of the ReLU layer and the conversion to rgb is additional information).

Anyway, this simple example is a case where it is possible to prove that rgb can perform better (if we compare with a limited model, like only a fixed number of neurons and layers).

edited May 08 '23 at 08:33

answered May 06 '23 at 21:20

Sextus Empiricus

77,915

I don't see how this addresses RGB. All of this can be said of breaking up a 0-1 normed spectrogram into 0-to-0.33, 0.33-to-0.66, 0.66-to-1, which can be said of any 0-1 normed input, which can be said of any unnormed input replacing 0 and 1 with min and max. – OverLordGoldDragon May 07 '23 at 12:45
@OverLordGoldDragon if you are converting gray scale to rgb, isn't this just the same as feeding multiple copies of the same image to the algorithm? An algorithm which will now need three times more nodes in the first layers. – Sextus Empiricus May 07 '23 at 15:51
@OverLordGoldDragon it can be more general and what I said may also apply to other manipulations that increase the size of the input data, without adding information. The fact that what I said here applies also more general, doesn't mean that it doesn't address reasons why conversion of gray-scale to RGB can change (and possibly improve) performance of machine learning algorithms. – Sextus Empiricus May 07 '23 at 16:03
"same as feeding multiple copies" that's far from the case. A red ball won't appear in G and B. Conversion is also strongly colormap-dependent, which is why there's fuss over jet. My point is that as written, your explanation is generic, while both the "RGB" and "save as image" are very specific. – OverLordGoldDragon May 07 '23 at 16:07
As for increasing model size, that can be done just by increasing model size, or (as you said) duplicating the input, meaning RGB/saving is nothing special. So your description may be technically correct and a fair side-point, but missing the root of the phenomenon. But that's the best case, as it assumes a non-detrimental colormap and limited image conversion losses, and both are big assumptions, especially latter for timeseries, where the time axis may be compressed unsafely by a thousandfold. – OverLordGoldDragon May 07 '23 at 16:08
@OverLordGoldDragon "A red ball won't appear in G and B" Maybe I misinterpret by what you mean with converting gray-scale to RGB. If I take a gray-scale image of a red ball, and convert that to RGB, then the red ball will appear the same in all color channels (yeah, depending on the color gamut mapping there might be slight discrepancies between the channels. But describing the effects of the simple case where every color channel has just the same value as the gray-scale channel already shows how what the major causes can be). – Sextus Empiricus May 07 '23 at 17:12
Sorry, bad example. In a colormap where red is maximum, spectrogram peaks will be absent from G and B. So for grayscale where high vs low values matter, conversion is discriminative, not duplicative. – OverLordGoldDragon May 07 '23 at 17:12
I don't follow your last comment. You are not speaking about a colormapping where the values are mapped like $$\text{red}(x,y) \approx \text{bw}(x,y) \ \text{green}(x,y) \approx \text{bw}(x,y)\ \text{blue}(x,y) \approx \text{bw}(x,y)$$ but instead something greatly different? (Even if this is not the case and the values are not approximate, then there isn't still a case of increasing the size of the input data and associated the size of the model?) – Sextus Empiricus May 07 '23 at 17:17
"meaning RGB/saving is nothing special. So your description may be technically correct and a fair side-point, but missing the root of the phenomenon" Unless the root of the problem is that the observed effect of converting B&W to RGB is effectively the increase of the model size. – Sextus Empiricus May 07 '23 at 17:21
I'm not familiar with common cmapping conventions so it seems I'm miscommunicating. The cmaps I have in mind will map low values to one color and high values to another, so your equations certainly don't hold; try plt.imshow(np.arange(10)[None], cmap='turbo'). If your equations hold, then your points are much more pertinent. But without equality, the explanation remains incomplete, though I certainly don't know how much it's missing. STFT values have strict interpretations - if they're tampered, it colossally changes the equivalent input it represents. Anyway, don't need RGB for repeat(x). – OverLordGoldDragon May 07 '23 at 17:33
@OverLordGoldDragon I see now that you are describing a different more complex mapping than I had imagined, but you aren't still converting a single channel to multiple channels? Or is your black and white image that you speak about already a 3-channel rgb image?(where red green and blue channels have the same values) – Sextus Empiricus May 07 '23 at 18:14
If R, G, and B have the same values then my talk goes out the window. I thought all this was clear so it's good to be checked, I'll update my answer when I get the chance. What is known though is such colormapping is the standard when dealing with time-frequency representations, and it's what my referenced papers used - clearly the story's different for "actual" images, so good to convey to a wider audience. – OverLordGoldDragon May 07 '23 at 18:19
@OverLordGoldDragon "If R, G, and B have the same values then my talk goes out the window" the point in my answer was explained with the assumption that RGB have equal values (ie when the transformation from BW to RGB is the identity function), and then you might already see improvements for the two reasons that I mentioned. When the RGB channels are different due to transformations that are slightly more complex functions, then the principles from my answer still hold. You are effectively increasing the size of the model. – Sextus Empiricus May 07 '23 at 18:23
Yes, sure. I'm not saying your answer is invalid, but again, it's more a side-point: imagine scaling the image by x *= abs(randn). The discussion then is clearly elsewhere, but of course it's not that extreme here. So your answer looks good to me with a disclaimer on incompleteness per this discussion - "assuming R=G=B", "the main explanation is elsewhere but these also play a role", etc. – OverLordGoldDragon May 07 '23 at 18:32
@OverLordGoldDragon my answer doesn't need to assume "R=G=B" in order for the explanations to work. The important assumption is that you go from 1 channel to 3 channels. The explanation of the effects of this increase in channels are just more easily imagined when "R=G=B", but the effects do not disappear when we use other transformations for the 1 channel to 3 channel transformation. I wouldn't say with confidence that this is a sidepoint. You would need to compare the conversion from BW to color while going from 3 channels to 3 channels in order to say whether there is something more. – Sextus Empiricus May 07 '23 at 18:42
Right. I meant it more like "below is the main explanation for RGB when converting greyscale image (note, != original spectrogram) to RGB, if R=G=B". (the "note" is for compression) -- Ok, then we can disagree on the side point status. – OverLordGoldDragon May 07 '23 at 18:49
@OverLordGoldDragon in my comments I have placed too much stress on the idea of using a colour mapping where R=G=B. What I considered instead is more general and that the use of more data (without increasing information) is associated with a change of the neural network being larger. This is not dependent on the R=G=B colour mapping, and I have changed my answer to make this more clear. – Sextus Empiricus May 08 '23 at 08:12
Eh, sorry but this answer is simply a bunch of post-hoc ML talk. The most valuable part is on ReLU, but not sufficient. The claimed "justifications" can be reused upon real and imaginary parts of complex-valued STFT, or synchrosqueezed spectrogram, for example, yet I guarantee that'd degrade performance. The explanation must account for the transform. And still no disclaimer on the "save as image" part. Anyway I think I decided I won't be elaborating further on this network, it'd be a wasted effort. – OverLordGoldDragon May 12 '23 at 01:23
@OverLordGoldDragon If you would give an example in your question of better performance due to the converting a black and white image to rgb and what 'save as image' means in this conversion, then it might be easier to write a better answer. Also, the justifications in this answer are not claimed to always work, but only when the network is not sufficiently wide or deep. – Sextus Empiricus May 12 '23 at 05:03
Hmm... I see now that the question reads as only gray->RGB. When I was writing, by "grayscale" I meant not colormapping or saving as image at all, i.e. spectrogram is already grayscale due to being non-negative. That or I didn't even realize the distinction. I think it's still a valid question with my answer being valid, as it was a genuine point of confusion that I had and others are likely to have, and a good answer will point out what OP is missing from their analysis. Still, it's also fair to answer it only as gray->RGB. I've edited, but it doesn't invalidate your or other partial answers. – OverLordGoldDragon May 13 '23 at 14:14

score 0 · Answer 2 · answered Jun 04 '22 at 12:23

I do not have very "hard" evidence, but I have a publication under review where we have trained ResNet50 to regress some values from noisy spectrograms.

Pretraining in ImageNet is better than starting from random initialization
For pretrained networks, using color spectrograms is better than grayscale spectrograms (normalized to 0-1)

All that I have is comparative experiments in a couple of datasets, so take it or leave it :)

OverLordGoldDragon · Accepted Answer · 2023-05-12T01:28:02.737

Colormapping is nonlinear filtering. A color map is simply a transform; the breakup into three dimensions further interprets it as filtering and decomposition. turbo is preferable to jet for inspection (1 -- 2 -- 3) - which is to say, it's not arbitrary, and the human visual system favors it. In turbo (or jet), as one use case, we can quickly skim an image for peaks, which will be red, and we may wish to focus only on those - that's identical to the "R" channel.
"Image" involves efficient (and nonlinear) compression. The standard approach to STFT compression is direct subsampling (i.e. hop_size), which aliases. An improvement is decimation, i.e. lowpass filtering + subsampling, which is a linear compression. If something so simple was effective, there'd be no need for all the sophistication of JPEG. In ML terms, we can view "save as JPEG" as a highly effective autoencoder, also effective dimensionality reduction.

There's more to say but I'll just share the main points for now.

Note that this is completely separate from using image-excelling NNs on STFT images. That can be detrimental.

Also, @Ghostpunk's answer is mistaken and misleading, as I commented. It may be owed to the popular "windowed Fourier transform" interpretation of STFT. Spectrogram losses can also be measured. Relevant posts:

Note

I realized the question, and my answer, are ill-suited for this network, and I may not be developing my answer further here. If I develop is elsewhere, I'll link it. In the meantime, refer to my discussion with @SextusEmpiricus.

Still self-accepting since, though elaboration is due, my answer can be understood with the right (mainly signal processing + feature engineering) background, and I believe it contains the most pertinent explanation.

Why convert spectrogram to RGB for machine learning?

3 Answers3

Width

Depth

Note