2

I am looking for a way to compare two sets of data in order to find out how similar they are to each other.

My application: I try to compare multiple Measurement methods that both measure the sound absorption parameter of a material (Foam, wall-absorbers etc) over frequency (X Axis). The values can be between 0 and 1, so the absorption can also be described as a percentage.

One of these methods is my "reference", that produces the most "trustable" data. So if I compare another method to that reference, in order to call it "similar", the absorption coefficients of each frequency band are going to be as close to each other as possible (ideally zero difference).

The method should take into account not only the strength of the linear relationship (correlation) but also the absolute values of the data points. The problem with correlation in Excel is that the result can be 100% correlation even if the curves run parallel to each other. I only want the result to be 100% if they are exactly the same, value by value.

For example if my reference values of absorption over the frequency are A: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 compared to B: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 should result in a high similarity, whereas A compared to C: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 should also indicate similarity but now 100% (because these absorption values run parallel and are not entirely similar).

A compared to D: 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 should also result in a value that indicates similarity but less than A compared to C because A and D are more "far" away from each other.

Does such a method exist or can be built in any way in excel or python? So far, I found the RMSE to be close to what I need but it does not take into account the shape of my data.

  • 3
    Welcome to cv :-). When you have removed the junk: It is not a problem of Excel - correlation is apparently simply not what you want. But your examples are too "easy" - what would you do with E: 6, 5, 4, 3, 2, 1 (just A, scrambled) or with F: 1, 2, 3, 4, 6, 8 - identical to A up to two entries? How would you define similarity then? There are zillions of ways to do that, and therefore it would help for potential answers if you specify more what you mean by similar and less similar. Also, if you always have data ordered from small to large, let us know. Best: explain the practical background – Ute Aug 23 '23 at 20:18
  • Sorry for the junk! @Galen. – Anton Wolf Aug 25 '23 at 10:02
  • @Ute I think I found the answer to my problem using the Root mean square error. With similarity I meant the absolute difference between the two compared data points (given both datasets have an equal length). Of course that RMSE does not tell me if both datasets are related in any way to each other (that was nice about the correlation) but for my use case the RMSE is enough! Is there a way how to combine RMSE and correlation in a useful way? – Anton Wolf Aug 25 '23 at 10:07
  • 1
    +1 to @Ute and I don't think your reply to him really answers it.

    To figure out which of many, many possible ways you should do this, you have to define "similarity" much more precisely. For instance, when you square errors (as RMSE does) you emphasize big errors. Is that what you want? Or is many small errors worse than one big one? Is the "shape" of the points relevant to you? Etc.

    – Peter Flom Aug 25 '23 at 10:34
  • 1
    Since you seem to try to find the best way to define similarity, and want something in between correlation and root of squared differences, the best way to get a good recommendation out of this cv community is to describe what exactly you want to compare, and when you would consider data sets as very different or only slightly different. As also @PeterFlom explained, there are so many different ways to do it tha emphasize different properties. Depending on the application they make more or less sense. – Ute Aug 25 '23 at 12:58
  • So my application: I try to compare multiple Measurement methods that both measure the sound absorption parameter of a material (Foam, wall-absorbers etc) over frequency (X Axis). One of these methods is my "reference", that produces the most "trustable" data. So if I compare another method to that reference, in order to call it "similar", the absorption coefficients of each frequency band are going to be as close to each other as possible (ideally zero). – Anton Wolf Aug 25 '23 at 19:19
  • Ah, this makes sense, so 1, 2, 3, 4, 5 are true function values, and 2, 3,... are values taken at the same frequencies? Your original example is confusing for statisticians, because the sequence of data points does not matter in most cases of statistical analysis. Could you perhaps edit your question and xlarify that by explaining the background as you did in your comment? – Ute Aug 25 '23 at 19:31
  • True, if I use RMSE, I emphasize big errors. I am not really sure if I want or need that, i am more interested in the average deviation in general. Could I get around this by performing the root operation right away after each substraction? So instead of root(sum((y1-y2)^2) / number_of_datapoints), could I just use sum(root((y1-y2)^2)) / number_of_datapoints? Also I am interested in the shape of both absorption curves. That is why I liked the correlation(CORREL() in Excel) a a lot in the first place because it kind of takes into account the general course or direction of where the data goes. – Anton Wolf Aug 25 '23 at 19:35
  • @Ute Thank you for telling me, I updated the question. Yes 1,2,3,4,5,6 are the function values at different advancing frequencies (which exact frequencies are not so important [third octave middle frequencies]. – Anton Wolf Aug 25 '23 at 20:17
  • 1
    Anton, good updates, I already +1 your question :-). And you already got an answer (have not checked yet whether it takes your changes into account, though) – Ute Aug 25 '23 at 20:19

1 Answers1

1

Depending on the way you define the calculation, $R^2$ might make sense for you.

However, I would not recommend just squaring the Pearson correlation between your vectors. That retains most of the issues you see from just Pearson correlation on its own. I would calculate according to the equation I give here.

$$ R^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-x_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar x \right)^2 }\right) $$

The $x_i$ are your reference values. The $y_i$ are the corresponding values from other measurement systems. The $\bar x$ is the mean of the $x_i$. $N$ is the sample size.

Your comments mention that root mean squared error/deviation might make sense for your purposes. Notice how related to RMSE the numerator of that fraction is (square RMSE and multiply by the sample size).

Advantages of this calculation are:

  1. It is available in Python through sklearn.metrics.r2_score.

  2. Because of how related this calculation is to the squared Pearson correlation of true and predicted values in OLS linear regression, this calculation arguably retains a flavor of Pearson correlation.

  3. This calculation flags terrible $y_i$ values with $R^2<0$, meaning that you would have a smaller RMSE if you just predicted $\bar x$ every time instead of doing whatever you do to get your $y_i$ values (which is, presumably, harder than running AVERAGE(X:X) in Excel).

An extension of this might use absolute deviations instead of squared deviations. I discuss this in my answers here. Such a calculation is available in Python through sklearn.metrics.d2_pinball_score.

$$ D^2=1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left\vert y_i-x_i \right\vert }{ \overset{N}{\underset{i=1}{\sum}}\left\vert y_i-\bar x \right\vert }\right) $$

This has the property of giving less penalty for large misses than the $R^2$ equation given earlier. This need not be a desirable property, but it might be. For instance, one terrible measurement might wreck all downstream work, no matter how good the others are. On the other hand, you might be willing to sacrifice one point if the others can be made closer to their reference values.

EDIT

The $D^2$ score I wrote above uses $\bar x$ in the denominator. For reasons related to what quantile regression estimates, it is reasonable to subtract the median of the $x$-values instead of the mean. I suspect this is how sklearn does the function, though what I wrote above does have a reasonable interpretation as relating to a comparison of the absolute loss incurred by your model and the absolute loss incurred by a naïve model that always predicts $\bar x$.

Dave
  • 62,186
  • Thank you so much for your answer, I think D^2 is really what I have been looking for! I don't see a reason to "punish" my compared measurement methods for producing values with high deviation to the reference. Therefore I will go with D^2 instead of R^2. Before I accept your answer, just let me know if there is a specific name to that D^2 calculation, or would you just call it a modified version of R^2? @Dave – Anton Wolf Aug 26 '23 at 09:50
  • 1
    @AntonWolf $D^2$ is just what sklearn calls that calculation (likely with a tweak I discuss in the edit), so that’s where I got the name. It is not $D^2$ as “Dave’s $R^2$ analogue”, and the documentation might give references that explain the name. // You might be interested in posting about whether your task warrants squaring the deviations. Absolute loss seems to have an easy interpretation but is less straightforward than one might hope. Stephan Kolassa has posts about this, if you search his profile for his answers. – Dave Aug 26 '23 at 13:23