4

I have some data n > 3000 https://drive.google.com/file/d/1gwB_U_TOX-IQHZJJDX-WeErLzrZZFoXu/view?usp=sharing (Third column) that I believe based on my physical theory should follow a Lomax distribution. $$ f(x) = \frac{ks^k}{(k + x)^{s+1}} ~~~~~x>0 $$ The parameters of the model are determined by theory and are $$ s, k = 7.82\times 10^{-6}, 2.21 $$

The goal is to twofold. First, I need to determine whether this is a good model for the data particularly near the right tail. Second, I need to compare this model with an exponential fit to the data. To that end, I first used expparetotest from LREP package which test whether data comes from a Pareto distribution and the results confirm that the distribution of data is Pareto and not exponential. I also use the "Jackson" test from the package Renext which gives me the following results.

$statistic
[1] 3.326906

$df [1] 3185

$p.value [1] 0

I then use pareto2_test_ad from agop to test if the data is Pareto II, which I have read is the same as Lomax. However, I am getting the following results.

pareto2_test_ad(my_data, 7.8e-6)
W = 467.93, p-value < 2.2e-16

Which if I understand correctly means data is not coming from the specified distribution.

Additionally, I am using result = gpdAd(lowFreq, bootstrap = TRUE, bootnum = 300) from eva and I get

$statistic
[1] 17.55134

$p.value [1] 0.003311258

$theta Scale Shape 7.517058e-07 1.478888e+00

My questions:

  1. Is my approach correct? Is AD test a good test to determine whether my data comes from Lomax distribution.
  2. Why do the estimates of the parameters are different from mine?
  3. Why does p-value change with bootnum?
User1865345
  • 8,202

2 Answers2

3

You could fit the distribution using the gamlss package in R and inspect diagnostic plots. To check whether the fitted parameters are consistent with the theoretical parameters, you could use Wald tests or confidence intervals for the parameters (see below).

Let's fit the distribution and check the diagnostics (I assume the data is loaded and named dat):

library(gamlss)
library(gamlss.dist)
# Fit Lomax distribution
mod <- gamlss(X0~1, family = PARETO2o, data = dat)

plot(mod) # Diagnostic plots wp(mod, ylim.all = c(1)) # Worm-plot (detrended Q-Q plot) dtop(mod) # Detrended Owen's plot

Diagnostics

Here are the worm plot (upper panel) and Owen's plot (lower panel):

WP

The Q-Q plot and the worm plot as well as Owen's plot show that the Lomax is not a good fit for the data, especially the right tail.

The fitted parameters can be retrieved by:

exp(coef(mod, "mu"))
exp(coef(mod, "sigma"))

So $\hat{\sigma} = 0.6768762$ (this is $k$ in your formula) and $\hat{\mu}= 5.09676\times 10^{-7}$ ($s$ in your formula). To check whether these are consistent with $\sigma = 2.21$ and $\mu = 7.82\times 10^{-6}$, let's do a joint Wald test for both coefficients:

Vmat <- vcov(mod) # Covariance matrix
coefmat <- matrix(c(coef(mod, "mu"), coef(mod, "sigma")), 2, 1) # Coefficient matrix
Lmat <- matrix(c(1, 0, 0, 1), 2, 2) # Hypothesis matrix
cmat <- matrix(c(log(2.21), log(7.82e-6)), 2, 1) # Hypothesized values
# Wald-statistic
Fval <- (1/2)*(t((Lmat%*%coefmat - cmat))%*%solve(Lmat%*%Vmat%*%t(Lmat))%*%(Lmat%*%coefmat - cmat))
# P-value
1 - pf(Fval, 2, df.residual(mod))
[1] 0

The $p$-value is practically $0$ indicating that the model is not consistent with the theoretical values.

Finally, let's calculate the confidence intervals for the parameters:

exp(confint(mod))
                         2.5 %       97.5 %
mu.(Intercept)    4.483221e-07 5.794263e-07
sigma.(Intercept) 6.359687e-01 7.204151e-01

The $95\,\%$ confidence interval for $\mu$ is $(4.48\times 10^{-7};5.79\times 10^{-7})$ and for $\sigma$ it's $(0.636; 0.720)$. The theoretical values are not part of the intervals, reaffirming our conclusion that the fitted model is not compatible with the theoretical paramter values.

COOLSerdash
  • 30,198
1

You could use a Kolmogorov-Smirnov (KS) test by plugging Maximum-Likelihood estimates into the target (Lomax) distribution function. Although the KS test should use a fully specified distribution (with no estimated parameters) using estimated parameters works fine when the number of observations is large as is the case here.

Inasmuch the interest seems to be on the tail, one can use the Peaks Over Threshold (POT) approach, which consists in fitting a distribution for the excesses over a threshold $u$ chosen to high enough. So instead of the original observations $Y_i$ we use the excesses $Y_i -u$, selecting the observations with $Y_i > u$. In this approach we favour the Generalized Pareto Distribution (GPD) for the excesses, because this distribution is "POT"-stable. The Lomax is a special case of the GPD.

Note that the data are poorly scaled which may lead to problems with estimation routines. This may explain why your estimates (which I suspect are given in the wrong order) differ from those coming from different packages. I would recommend to scale the data. It is also worth noting that if a specific order exists for the observations (e.g. records in time) these could have different "regimes". If the whole distribution was to be investigated, we would probably have to use a mixture of parametric distributions as suggested by the density of logged observations obtained with density.

> dat <- 1e5 * dat[ , 3]
> library(Renext)
> fitL <- flomax(dat)
> coL <- fitL$estimate
> coL

shape scale

0.67633153 0.05085796

> ks.test(x = dat, y = "plomax", scale = coL["scale"], shape = coL["shape"])

Warning in ks.test.default(x = dat, y = "plomax", scale = coL["scale"], : ties

should not be present for the Kolmogorov-Smirnov test

Asymptotic one-sample Kolmogorov-Smirnov test

data: dat

D = 0.055853, p-value = 4.688e-09

alternative hypothesis: two-sided

So we clearly reject the null hypothesis: the observations can not be regarded as forming a sample of a Lomax distribution.

Now let us try a POT model.

> u <- 3.0
> exc <- dat[dat > u] - u
> fitL2 <- flomax(exc)
> coL2 <- fitL2$estimate
> ks.test(x = exc, y = "plomax", scale = coL2["scale"], shape = coL2["shape"])

Warning in ks.test.default(x = exc, y = "plomax", scale = coL2["scale"], : ties

should not be present for the Kolmogorov-Smirnov test

Asymptotic one-sample Kolmogorov-Smirnov test

data: exc

D = 0.047207, p-value = 0.9009

alternative hypothesis: two-sided

The Kolmogorov-Smirnov test tells that a Lomax distribution is quite good for the chosen threshold. The same conclusion seems to arise with different choices of the threshold.

Now let us turn to the question: could the tail be exponential rather than Lomax? This is nearly a routine test in POT and we can use LRExp.test, the null hypothesis being an exponential distribution. As explained in the documentation, the Likelihood-Ratio test uses by default a numeric approximation of the test statistic. We can use a simulation method to get a more precise result as follows

LRExp.test(exc, method = "sim", nSamp = 100000)
   ## $statistic
   ## [1] 5.662787
   ## 
   ## $df
   ## [1] 146
   ## 
   ## $p.value
   ## [1] 0.00541
   ## 
   ## $method
   ## [1] "LR test of exponentiality"
   ## 
   ## $alternative
   ## [1] "lomax"

So clearly we should reject the exponentiality (of the tail) and favour the Lomax alternative. Though slightly less powerful, we could have used Jackson's test similarly (with the same conclusion).

We could have use the Renouv function of Renext to fit a POT model. The KS test would then on "jitterized" version of the observations to remove ties hence discard the warning.

Yves
  • 5,358