4

https://www.r-bloggers.com/2019/01/quantile-regression-in-r-2/

I see the above method. The regression result is a straight line. But the quantile of the real data may not be on a straight line. The following is a made-up example.

R> x=seq(from=.5, to=5, len=1000)
R> d=data.frame(x=x, y=log(rgamma(1000,shape=x)))
R> library(quantreg)
Loading required package: SparseM

Attaching package: ‘SparseM’

The following object is masked from ‘package:base’:

backsolve

R> rqfit=rq(y ~ x, data = d) R> summary(rqfit)

Call: rq(formula = y ~ x, data = d)

tau: [1] 0.5

Coefficients: coefficients lower bd upper bd (Intercept) -0.61739 -0.78560 -0.48523 x 0.48435 0.44357 0.52561 Warning message: In rq.fit.br(x, y, tau = tau, ci = TRUE, ...) : Solution may be nonunique

When the number of data points is enough, one can 1) group the data into bins along the x-axis, or 2) for each data point, look at its neighboring points in the x-axis to empirically estimate the quantiles. But this solution is not elegant, especially (1) at the bin boundaries or (2) at the left/right ends.

Here is some real data. It seems that the range is more in the middle than the left and right ends.

https://i.stack.imgur.com/8sS0H.gif

tail -c +43 8sS0H.gif > 8sS0H.tsv

Could anybody show a way better than the bin or neighboring method to solve this problem?

I tagged this post with quantile-regression. But an elegant solution may require thinking out of the box. If anybody knows better tags, please add them as appropriate.

enter image description here

EDIT: Using the method by COOLSerdash. I got the following figure (V1 is the first column, V2 is the 2nd column, f is the input data). But the lines converge to the same point at the left end. Ideally, they should not converge even when there are not many data points at the ends. The top quantile line goes too high on the right so that there are no actual points supporting it. This forfeits the purpose of using quantile.

Is there a way to make the method more robust?

plot(V2~V1, data = f, type = "n", las = 1))
points(f$V1, f$V2, cex = 0.5, col = "#0000004C")
for (tau in 1:4/5) {
  rqfit <- rq(V2 ~ ns(V1, df = 5), tau = tau, data = f)
  fit <- predict(rqfit, newdata = data.frame(V1 = sort(f$V1)))
  lines(fit~sort(f$V1), col = "red", lwd = 1.5)
}

enter image description here

EDIT2: quantsheets still does not guarantee the order of quantile curves.

Using the following data, the quantsheet result looks like this. The quantile curves still cross on the top right corner. Any solution to this problem?

wget -qO-  https://i.stack.imgur.com/qlAM2.gif | tail -c +43

enter image description here

  • 1
    How would you do this in the least squares setting (more like "regular" linear regression)? Techniques I know range from splines to deep learning. What would you do? – Dave Aug 04 '21 at 17:07
  • Regarding your edit, it can get worse than that. When you start allowing for nonlinear trends like splines allow, the quantiles can cross; that is, the predicted $q_{0.51}$ will be less than the predicted median! I have heard this called quantile crossover. – Dave Aug 04 '21 at 21:43
  • But the rudimental bin or neighboring methods do not have these problems. So there are no methods better than the bin and neighboring methods? – user1424739 Aug 04 '21 at 22:03

1 Answers1

4

Edit

OP has revealed that they want to fit quantile curves with the additional contstraint that the quantile curves do not cross. This is possible using a quantile sheet (Eilers & Schnabel 2013). This algorithm estimates all quantiles simultaneously which avoids (but does not eliminate completely) the problem of crossing quantiles. Quantile sheets are implemented in the gamlss package in R (Stasinopoulos et al. 2017). The quantile curves are estimated using B-Splines. Using the artificially generated data from the original question:

library(gamlss)
library(colorspace)
# Fit the quantile sheet
gamlss_qsheet <- quantSheets(y, x, data = d)
# Predictions for plotting
xx <- seq(0.5, 5, length.out = 5000)
preds <- predict(gamlss_qsheet, newdata = data.frame(x = xx))
# Plotting
plot(y~x, data = d, type = "n", las = 1, ylim = c(-6, 2.5))
points(d$x, d$y, cex = 0.5, col = "#0000004C")
cols <- rainbow_hcl(ncol(preds), start = 10, end = 350)
for(i in seq_len(ncol(preds))) {
  lines(preds[, i]~xx, col = cols[i], lwd = 2)
}

QuantSheet

Original answer

It seems that you want to allow for potentially nonlinear associations of the quantiles. One method that achieves that are splines and the splines package works fine with rq in R. In the following example, I use your data generating process and natural splines with 4 internal knots to fit the 0.2, 0.4, 0.6 and 0.8 percentiles.

library(quantreg)
library(splines)

set.seed(142857) # Reproducibility x <- seq(from = 0.5, to = 5, len = 1000) d <- data.frame(x = x, y = log(rgamma(1000, shape = x)))

Plots

plot(y~x, data = d, type = "n", las = 1, ylim = c(-4, 2.5)) points(d$x, d$y, cex = 0.5, col = "#0000004C")

Fit the quantile regressions

for (tau in 1:4/5) { rqfit <- rq(y ~ ns(x, df = 5), tau = tau, data = d) xx <- seq(0.5, 5, length.out = 5000) fit <- predict(rqfit, newdata = data.frame(x = xx)) lines(fit~xx, col = "red", lwd = 1.5) }

QR_splines

It's apparent that there is considerable nonlinearity present, especially for the lower percentiles.

References

Schnabel SK and Eilers PHC (2013): Simultaneous estimation of quantile curves using quantile sheets, AStA Advances in Statistical Analysis, 97, 1, pp 77-87, Springer.

Stasinopoulos MD, Rigby RA, Heller GZ, Voudouris V, De Bastiani F (2017): Flexible Regression and Smoothing. Using GAMLSS in R. CRC Press.

COOLSerdash
  • 30,198
  • How to determine the df as 5? Is it related to the number of taus used? – user1424739 Aug 04 '21 at 18:44
  • I chose that arbitrarily. Usually, you'll need no more than 3-7 knots. If you knew from theory or literature where important changes occur, set a knot there. You could also use cross validation or AIC to choose the number of knots and get the most "bang for the buck". – COOLSerdash Aug 04 '21 at 18:45
  • I don't quite get how model.matrix(y~ns(x, df = 5), data = d) works. ns(x, df = 5) is of length 5000, but y is of length 1000. How can they match each in this function? – user1424739 Aug 04 '21 at 18:48
  • @user1424739 I updated the code. It now uses the more familiar and convenient predict instead of model.matrix. – COOLSerdash Aug 04 '21 at 19:00
  • See my update about the question on the real data. – user1424739 Aug 04 '21 at 21:36
  • So cent of quantSheets has to be of multiple levels even I just need one quatile level? I tried with just one cent level, quantSheets(y, x, data = d, cent = 100 * pnorm(-4 * 2/3)). But I got an error Error in solve.default(Q + P, r) : system is computationally singular: reciprocal condition number = 3.99533e-18. – user1424739 Aug 05 '21 at 11:32
  • @user1424739 A sheet with just one quantile would be just a regular quantile regression on this quantile. Quantile sheets fit multiple quantiles simultaneously. – COOLSerdash Aug 05 '21 at 11:35
  • So to have a more robust estimate of the quartile level that I am interested, it is better to specify other levels? The extra levels used can help with making each other level more robust? – user1424739 Aug 05 '21 at 11:38
  • What exactly do you mean by "robust"? Robust against what? Quantile sheets alleviate the problems of overlapping quantiles so they are robust against that particular problem. – COOLSerdash Aug 05 '21 at 11:41
  • The lines for two different quantile levels don't cross each other only when I specify both of them for the cent parameter in one call of quantSheets? If I specify them into to separate calls of quantSheets (and subsequent calls), then the resulted lines may still cross each other? – user1424739 Aug 05 '21 at 11:44
  • "The lines for two different quantile levels don't cross each other only when I specify both of them for the cent parameter in one call of quantSheets?" Yes, that's my understanding. – COOLSerdash Aug 05 '21 at 11:45
  • quantSheets can only be used for 2-d regression (as I don't see a support of any formula)? If there are two explanatory variables and one response variable, is there an equivalent variant? – user1424739 Aug 05 '21 at 11:55
  • @user1424739 I'm not aware of one. – COOLSerdash Aug 05 '21 at 11:59