4

I have tested a large sample of participants on two different tests of visual perception – now, I'd like to see to what extent performance on both tests correlates.

To visualise the correlation, I plot a scatterplot in R using ggplot() and I fit a regression line (using stat_smooth()). However, since both my x and y variable are performance measures, I need to take both of them into account when fitting my regression line – thus, I cannot use a simple linear regression (using stat_smooth(method="lm")), but rather need to fit an orthogonal regression (or Total least squares). How would I go about doing this?

I know I can specify formula in stat_smooth(), but I wouldn't know what formula to use. From what I understand, none of the preset methods (lm, glm, gam, loess, rlm) are applicable.

talat
  • 66,143
  • 20
  • 123
  • 153
rvrvrv
  • 830
  • 3
  • 9
  • 27
  • 1
    You can pass the `slope` and `intercept` from your model to `geom_abline` or you can use the approach shown [here](http://stackoverflow.com/questions/19284897/smooth-pspline-wrapper-for-stat-smooth-in-ggplot2) to create your own method – user20650 Nov 18 '14 at 15:06

3 Answers3

6

It turns out that you can extract the slope and intercept from principal components analysis on (x,y), as shown here. This is just a little simpler, runs in base R, and gives the identical result to using Deming(...) in MethComp.

# same `x and `y` as @user20650's answer
df  <- data.frame(y, x)
pca <- prcomp(~x+y, df)
slp <- with(pca, rotation[2,1] / rotation[1,1])
int <- with(pca, center[2] - slp*center[1])

ggplot(df, aes(x,y)) + 
  geom_point() + 
  stat_smooth(method=lm, color="green", se=FALSE) +
  geom_abline(slope=slp, intercept=int, color="blue")

Brian D
  • 2,388
  • 24
  • 41
jlhoward
  • 56,091
  • 6
  • 91
  • 135
  • Great method while avoiding additional packages. One further question, more of aesthetics perhaps, is how to limit the length of `geom_abline()` to the data, like `stat_smooth()`? Currently `geom_abline()` extends all the way across the plot, regardless of whether the data points extend all the way across the plot. – rvrvrv Nov 21 '14 at 10:56
  • 1
    One way is to use `geom_segment`. You know the minimum and maximum of the x-range in the data, so use the slope and intercept to calculate the y values at these limits and then use `geom_segment` to draw the line. Or you could replace the `Deming` function with the nice `prcomp` method in the function `f` below. – user20650 Nov 21 '14 at 12:18
4

Caveat: not familiar with this method

I think you should be able to just pass the slope and intercept to geom_abline to produce the fitted line. Alternatively, you could define your own method to pass to stat_smooth (as shown at the link smooth.Pspline wrapper for stat_smooth (in ggplot2)). I used the Deming function from the MethComp package as suggested at link How to calculate Total least squares in R? (Orthogonal regression).

library(MethComp)
library(ggplot2)

# Sample data and model (from ?Deming example) 
set.seed(1)
M <- runif(100,0,5)
# Measurements:
x <-         M + rnorm(100)
y <- 2 + 3 * M + rnorm(100,sd=2)

# Deming regression
mod <- Deming(x,y)

# Define functions to pass to stat_smooth - see mnel's answer at link for details
# Defined the Deming model output as class Deming to define the predict method
# I only used the intercept and slope for predictions - is this correct?
f <- function(formula,data,SDR=2,...){
        M <- model.frame(formula, data)
        d <- Deming(x =M[,2],y =M[,1], sdr=SDR)[1:2]
        class(d) <- "Deming"
        d  
        }

# an s3 method for predictdf (called within stat_smooth)
predictdf.Deming <- function(model, xseq, se, level) {
                         pred <- model %*% t(cbind(1, xseq) )
                         data.frame(x = xseq, y = c(pred))
                         }

ggplot(data.frame(x,y), aes(x, y)) + geom_point() + 
          stat_smooth(method = f, se= FALSE, colour='red', formula=y~x, SDR=1) +  
          geom_abline(intercept=mod[1], slope=mod[2], colour='blue') +
          stat_smooth(method = "lm", se= FALSE, colour='green', formula = y~x)

enter image description here

So passing the intercept and slope to geom_abline produces the same fitted line (as expected). So if this is the correct approach then imo its easier to go with this.

Community
  • 1
  • 1
user20650
  • 23,419
  • 5
  • 52
  • 85
2

The MethComp package seems to be no longer maintained (was removed from CRAN). Russel88/COEF allows to use stat_/geom_summary with method="tls" to add an orthogonal regression line.

Based on this and wikipedia:Deming_regression I created the following functions, which allow to use noise ratios other than 1:


deming.fit <- function(x, y, noise_ratio = sd(y)/sd(x)) {
  if(missing(noise_ratio) || is.null(noise_ratio)) noise_ratio <- eval(formals(sys.function(0))$noise_ratio) # this is just a complicated way to write `sd(y)/sd(x)`
  delta <-  noise_ratio^2
  x_name <- deparse(substitute(x))

  s_yy <- var(y)
  s_xx <- var(x)
  s_xy <- cov(x, y)
  beta1 <- (s_yy - delta*s_xx + sqrt((s_yy - delta*s_xx)^2 + 4*delta*s_xy^2)) / (2*s_xy)
  beta0 <- mean(y) - beta1 * mean(x) 

  res <- c(beta0 = beta0, beta1 = beta1)
  names(res) <- c("(Intercept)", x_name)
  class(res) <- "Deming"
  res
}

deming <- function(formula, data, R = 100, noise_ratio = NULL, ...){
  ret <- boot::boot(
    data = model.frame(formula, data), 
    statistic = function(data, ind) {
      data <- data[ind, ]
      args <- rlang::parse_exprs(colnames(data))
      names(args) <- c("y", "x")
      rlang::eval_tidy(rlang::expr(deming.fit(!!!args, noise_ratio = noise_ratio)), data, env = rlang::current_env())
    },
    R=R
  )
  class(ret) <- c("Deming", class(ret))
  ret  
}

predictdf.Deming <- function(model, xseq, se, level) {
  pred <- as.vector(tcrossprod(model$t0, cbind(1, xseq)))
  if(se) {
    preds <- tcrossprod(model$t, cbind(1, xseq))
    data.frame(
      x = xseq,
      y = pred,
      ymin = apply(preds, 2, function(x) quantile(x, probs = (1-level)/2)),
      ymax = apply(preds, 2, function(x) quantile(x, probs = 1-((1-level)/2)))
    )
  } else {
    return(data.frame(x = xseq, y = pred))
  }
}

# unrelated hlper function to create a nicer plot:
fix_plot_limits <- function(p) p + coord_cartesian(xlim=ggplot_build(p)$layout$panel_params[[1]]$x.range, ylim=ggplot_build(p)$layout$panel_params[[1]]$y.range)

Demonstration:

library(ggplot2)

#devtools::install_github("Russel88/COEF")
library(COEF)

fix_plot_limits(
    ggplot(data.frame(x = (1:5) + rnorm(100), y = (1:5) + rnorm(100)*2), mapping = aes(x=x, y=y)) +
      geom_point()
    ) +
  geom_smooth(method=deming, aes(color="deming"), method.args = list(noise_ratio=2)) +
  geom_smooth(method=lm, aes(color="lm")) +
  geom_smooth(method = COEF::tls, aes(color="tls"))

Created on 2019-12-04 by the reprex package (v0.3.0)

jan-glx
  • 5,998
  • 1
  • 37
  • 54
  • May you know why the confidence interval calculated using your function at `noise_ratio = 1` is slightly different from that produced by the COEF::tls method? P.S. Beautiful answer as it is the only one that includes confidence intervals. Thank you! – Cris Sep 28 '20 at 15:17
  • 1
    They are estimated with the bootstrap thus are stochastic, you could try to increase the number of bootstrap samples (`R`) and check if that helps to make them more similar! Let me know if there is a systematic difference! – jan-glx Sep 28 '20 at 16:03
  • Oh, you're right! the interval slightly changes in between runs. Thank you. – Cris Sep 28 '20 at 18:53
  • 1
    Extending geom_smooth like this is discouraged by the authors (https://github.com/tidyverse/ggplot2/issues/3132) (but I am not sure what the alternatives are). To work around the non-export of predictdf y ggplot2 one could use `registerS3method("predictdf", "YourClass", yourpackage:::predictdf.YourClass, envir = environment(ggplot2:::predictdf))` in yourpackage's `.on.load` – jan-glx Dec 01 '20 at 16:35
  • [Another geom_smooth hack.](https://stackoverflow.com/a/58187435/1870254). – jan-glx Dec 01 '20 at 16:35