Suppose we want to get a density estimate of some data X. One way is to compute the empirical CDF,
N <- 1e5
x <- seq(min(X), max(X), length=N)
F <- ecdf(X)(x)
and then the density, by taking the derivative of F.
Question 1
Suppose the density
D <- diff(X, lag=1) / diff(x, lag=1)
Then the area under the density A <- sum(D) * diff(x)[1] will equal 1 (here assuming that x is an evenly spaced vector). However it may be desirable to smooth the estimate by taking a lag value greater than 1. e.g.
lg <- 50
D <- diff(X, lag=lg) / diff(x, lag=lg)
What do we need to do to D such that A <- sum(D) * diff(x)[1] is equal to unity? Namely, when taking lg > 1, how do we maintain that D is a density.
Question 2
Supposing x is the natural domain of our observable, we may instead want to consider a re-scaled domain. This can be done in two ways:
- Take
Fon the natural domainxand then graphxpvs.Dwherexpis for instance a constant re-scaling ofx:xp <- x / max(X). - Take
Fon the re-scaled domain and proceed as before with density estimation. e.g.F <- ecdf(X / max(X))(x)wherex <- seq(0, 1, length=N).
I am interested in the second point, but have essentially the same problem as question 1. In point 2 we introduce a change of variables, so D should be multiplied by something to account for this. It seems to make sense on paper, but I'm not sure how to implement numerically - plus, a constant re-scaling seems too trivial to require computing inverse functions etc. Bonus points if you can combine Q1 and Q2.
As far as density estimation goes, I don't think the in built kernel density estimates (e.g. R's
– algae Apr 21 '21 at 00:31density) are very helpful at all, too much B.T.S. Histograms are O.K., but in both of these cases one needs to involve parameters (number of bins, etc). The ECDF tells you exactly what the data represents, no more, no less.lagvalue which smooths this out and doesn't wrongly impact on the data. – algae Apr 21 '21 at 00:33