Remove points from QQ plot of genome-wide association study

Question

I have results from a genome-wide association study, which is basically a bunch of univariate tests (akin to linear regression). Results include statistical significance of each test, and thus I want to know if significance of observed tests deviates from expected. A typical way to do this is to plot a QQ plot of expected vs observed P values. Issue is that I have 14 million points to supply to R plot() and thus plotting is very slow. Here is code to compute expected values as well as plot all points:

obs <- sort(d$P)
obs <- obs[!is.na(obs)]
obs <- obs[is.finite(-log10(obs))]

exp <- c(1:length(obs))  / (length(1:length(obs))+1) 

x <- exp
y <- obs

system.time(
  plot(-log10(x), -log10(y),
       pch=".", cex=4,
       xlab="x", ylab="y", main="QQ Plot")
)
abline(0,1,col="red",lwd=3, lty=3)

The plot takes about 10 min to render.

There was answer on Cross Validated that seems to address this so I decided to adapt the code in the top answer to see if it will work with my data:

https://stats.stackexchange.com/a/35264/53539

What I did was to simply use the quant.sample() function to compute a new set of x and y values for the plot() function:

quant.subsample <- function(y, m=100, e=1) {
  # m: size of a systematic sample
  # e: number of extreme values at either end to use
  x <- sort(y)
  n <- length(x)
  quants <- (1 + sin(1:m / (m+1) * pi - pi/2))/2
  sort(c(x[1:e], quantile(x, probs=quants), x[(n+1-e):n]))
  # Returns m + 2*e sorted values from the EDF of y
}
n.x <- n.y <- length(obs)
m <- .001 * max(n.x, n.y)
e <- floor(0.0005 * max(n.x, n.y))
x <- quant.subsample(exp, m, e)
y <- quant.subsample(obs, m, e)

Plotting is much faster, but as you can see the result is not identical in appearance to plotting all points:

I am not an experience statistician, so clearly shot myself in the foot to some extent!

Questions:

Is what I did even a valid approach to solving the issue?
If approach is valid, how do I obtain a plot with identical appearance? It is simply modifying the m and e parameters until I get the desired result?

Of course it's not identical--it represents a sample. The shapes of the plots, though--if you were to connect the points from left to right--would be indistinguishable. The question then remains whether you should be connecting the points in that way. One of the things you look for in these plots is sudden jumps. You could locate any such jumps by scanning for large gaps in $x$ and $y$ and avoiding connecting them. That ought to give you identical plots whenever the dataset sizes are sufficiently large. In the referenced thread I posted another solution that will do a better job, btw. — whuber, Jul 19 '18 at 12:30
Thanks. I understand. I also noted that decreasing m to 0.0001 and increasing e to 0.005 generates a plot where the tail is nearly identical to that of plotting all the points. So, in the answer to the linked question you state the following "No information is lost whatsoever". Does this refer to the shape of the curve? I mistakenly assumed it meant that the non-overlapping points in the plot would be identical to that plotting all points due to sampling the tail at a higher rate. — Vince, Jul 19 '18 at 13:00
Yes, it refers to the shape of the curve: that's what that other thread is concerned about. — whuber, Jul 19 '18 at 13:03
The adaptive version runs out of memory when using t.y=0.0005. My laptop has 16GB of RAM. I will test higher t.y values, but at this point seems like the random sampling approach is best given the size of the dataset and the performance I am looking to achieve. I simply need to clearly state that the QQ-plot represents a random sample. — Vince, Jul 19 '18 at 13:27
You have a problem with your installation, then. I just ran the adaptive version with 14 million points in both datasets and monitored R's RAM usage during the process. It never exceeded 5 GB. — whuber, Jul 19 '18 at 15:23
I was able to solve this by fixing the following code, sort(c(x[1:e], quantile(x[e:(n+1-e)], probs=quants), x[(n+1-e):n])), which exclude the tails from sampling procedure. I now get identical plots, at least when I use appropriate values of m and e. — Vince, Aug 16 '18 at 18:07

score 1 · Answer 1 · answered Jul 19 '18 at 13:15

Regarding question 1:

Ask yourself if the plot conveys the information you'd like your audience to receive. Is it necessary that every point be plotted in a single plot, or would perhaps intentional subsets provide more visual detail? If you want to plot a random sampling and use that for your visual, be sure to indicate that you are plotting a random sampling in your text.

A few suggestions regarding question 2 are:

Increase the memory that R can use.
memory.limit(size=4000) would increase the memory limit to 4000 MB.

Plot directly to an image file. Note that this saves the plot as an image file to your working directory. You can do this by using the following code:

jpeg('imagename.jpg')
plot(-log10(x), -log10(y),
   pch=".", cex=4,
   xlab="x", ylab="y", main="QQ Plot")
   )
abline(0,1,col="red",lwd=3, lty=3)
dev.off()

Thanks. I agree with point 1 that regardless of what approach I take, the method used to generate the visualization should be clearly stated. For point 2, print(memory.limit()) shows it is already at max of 16GB. Yes, plotting to file will help, but still very slow :( Basically, I wanted to see how far I can push performance and still obtain an identical looking plot. Seems like it's unlikely to work. — Vince, Jul 19 '18 at 13:23

Remove points from QQ plot of genome-wide association study

1 Answers1

Linked