Find r-squared of points around an x=y line (R)

Question

I have some data I obtained from a lab experiment that shows the theoretical value of variable A, against its actual, experimentally derived value, for a range of variable A.

I've plotted the data as scatter plots and have an x=y line to demonstrate what would have been 100% accurate. I'd like to quantify this accuracy using R-squared.

minimal dataset:

library(ggplot2)
df <- data.frame(pred = c(-20, -10, 0, 10, 20),
   real = c(-15, -7, 2, 8, 22))
plot <- ggplot(df, aes(real, pred))+
  geom_point(color = "Red", size =4, shape = 4)+
  geom_abline(color= "black", size = 0.6)
plot

So usually when I'm assessing the accuracy of a model I look at the R-squared in the summary but as both the real and predicted value were found experimentally I can't do that.

I feel like my best bet is to find the R-squared of these points against the line x=y, but I can't find anywhere how to do that without going through an lm, but surely any coefficients would meddle with the accuracy that I'm looking for?

I also tried defining R-squared manually as follows:

rsq <- function (x, y) {
  rsq_scors <- cor(x, y) ^ 2
  return(rsq_scors)
}
score <- rsq(df$pred,df$real)
which returns:
[1] 0.9985488

Which looks okay except when you consider that if I multiple the real values by 10, I get the same outcome, which is why I really wan't to track it to x=y:

df <- data.frame(pred = c(-20, -10, 0, 10, 20),
       real = c(-150, -70, 20, 100, 200))
score <- rsq(df$pred,df$real)
which also returns:
[1] 0.9985488

Be careful with the interpretation when you aren’t finding the line through a least squares fit, however. // You might be more interested in a measure like mean absolute deviation that measures the average amount by which predictions differ from observed values (slightly different from root mean squared deviation/error). — Dave, Oct 12 '21 at 11:00
Because the squared distance from a point $(x,y)$ to the diagonal line is simply $(x-y)^2/2,$ just compute these values and average them to obtain a mean squared difference. You do not want to divide by some "variance" to create an $R^2$ value because the result will be deceptively high and will depend somewhat arbitrarily on the range of experimental values, making it impossible to make valid comparisons between any two $R^2$ values. — whuber, Oct 12 '21 at 14:54

rapaio · Answer 1 · 2021-10-12T23:35:44.007

1

$R^2$ aka Coefficient of determination has the following meaning:

$$R^2=1 - \frac{SS_{residual}}{SS_{total}}$$, where $SS$ denotes sum of squares and we have sum of squares of the residual divided by total sum of squares. A more meaningful way is to put the number of observations $n$ in denominator and numerator and get sample estimates for variance of residuals divided by total variance.

Since you model is fixed to $y = 0 + 1*x$. Than you have $SS_{residual} = \sum_{i=1}^n (y_i - x_i)^2$ and the initial sum of squares is $SS_{total} = \sum_{i=1}^n (y_i - \bar{y})^2$.

edited Oct 12 '21 at 23:35

answered Oct 12 '21 at 10:48

rapaio

6,974

This is not the $R^2$ described in the question. – whuber Oct 12 '21 at 14:51
1

I read multiple times the question and I still believe that the Op asked about the coefficient of determination. But this could be clarified, I suppose. – rapaio Oct 12 '21 at 23:31

Find r-squared of points around an x=y line (R)

which returns:

which also returns:

1 Answers1