Suppose I have some sample data $x_i$ then I can estimate the quantile $Q_p(x_i)$ using for example the quantile() function in R.
Now suppose I add some random noise to the data: $y_i=x_i+\epsilon_i$ (keeping the $x_i$ unchanged) where $\epsilon_i$ are i.i.d. and drawn from some distribution with zero mean.
Is there anything I can say about the distribution of $Q_p(y_i)$?
I've done some numerical experiments in R in which the $x_i$ are constructed at the outset from a normal distribution and then the $\epsilon_i$ are randomly drawn from a known distribution (either uniform or normal). $Q_p(y_i)$ is calculated 500 times with different random $\epsilon_i$ to estimate its distribution.
It looks like $Q_p(y_i)$ follows a bell shaped curve with a larger mean than $Q_p(x_i)$. Is there any theory on this?
R code below:
x <- rnorm(1e6,0,1/qnorm(0.95))
Q_simulated <- rep(NA,500)
for(s in 1:500)
{
epsilon <- rnorm(length(x),0,0.05)
y <- x+epsilon
Q_simulated[s] <- quantile(y,0.95)
}
ggplot(data.frame(x=Q_simulated),aes(x)) + geom_histogram() + geom_vline(xintercept=quantile(x,0.95),colour="red")
EDIT:
Drawing a scatter plot of $x$ versus $y=x+\epsilon$ in red, superimposing the unit line in black and various quantiles $(Q_p(x),Q_p(y))$ in blue gives the following plot:
x <- rnorm(1e5, 0, 1/qnorm(0.95))
epsilon <- rnorm(length(x), 0, 0.2)
y <- x + epsilon
p <- seq(0.1, 0.9, 0.1)
p <- c(0.01*p, 0.1*p, p, 0.9+0.1*p, 0.99+0.01*p)
qx <- quantile(x,p)
qy <- quantile(y,p)
ggplot(data.frame(x=x, y=x+epsilon), aes(x, y)) + geom_point(colour="red", alpha=0.04) + geom_abline(slope=1, intercept=0) + geom_point(data=data.frame(x=qx, y=qy), colour="blue")


quantile()'s definition (Hyndman & Fan Type 7)? – Bob Mortimer Feb 27 '18 at 23:48x? (You don't need a sample of a million: a sample of a few thousand should do just fine.) – whuber Feb 27 '18 at 23:55xcan vary thenyis a normal with slightly increased variance and the red line=$Q$ is spot on in the centre. However in my questionxis fixed in advance and I am adding noise to it, in effect $y_i$ are random variables which are not i.i.d – Bob Mortimer Feb 28 '18 at 00:05xis fixed in advance, it's useless to compare the mean of the quantiles in your code to $1$: you need to compare that mean to the 95th percentile ofxitself. – whuber Feb 28 '18 at 14:54geom_vline(xintercept=quantile(x,0.95). I'm just trying to get to the bottom of why the random noise - which has pluses and minuses - seems to increase the quantile estimate. It wasn't clear from a previous comment but I see this consistently when re-running the script (so different x) or changing parameters in the script such as the s.d. of $\epsilon$ – Bob Mortimer Feb 28 '18 at 22:25