Is my data fat tailed in terms of alpha

Question

From Wikipedia I have the compliment of the CDF parameterized for fat-tails distributions.

$$ \Pr[X>x] \sim x^{- \alpha}\text{ as }x \to \infty,\qquad \alpha > 0.\, $$

Here $\alpha$ is the fatness parameter. According to Taleb. $\alpha \leq 2.5$ is forecastable, but $\alpha > 2.5$ is not.

I would like to fit $\alpha$ given my data so I can mark it as forecastable or not.

I thought I would start by trying to fit my data to a linear model.

set.seed(42)
df_tails <- tibble(y = 1- seq(0.01,1, 0.01),
               norm = sort(rnorm(n = 100, 0,1)), 
               cauchy = sort(rcauchy(n = 100, 0,1)))
lm(log(y) ~ norm - 1, data = df_tails)
lm(log(y) ~ cauchy - 1, data = df_tails)

The problem is that I end up with many NAs so I think I am coding something wrong.

Try 2

library(tidyverse)
set.seed(42)
df_tails_raw <- tibble(y = log(1- seq(0.01,1, 0.01)),
               norm = log(sort(rnorm(n = 100, 0,1))), 
               cauchy = log(sort(rcauchy(n = 100, 0,1))))
df_tails <- na.omit(df_tails_raw)
df_tails |>
  ggplot() +
  geom_point(aes(x = norm, y=y), color = 'tomato', size = 2, stroke = 2, shape = 1) +
  geom_point(aes(x = cauchy, y = y), color = 'grey50', size = 2, stroke = 2, shape = 1) +
  theme_classic() +
  labs('Red is normal and Grey is Cauchy')
lm(y ~ norm, data = df_tails)
lm(y ~ cauchy - 1, data = df_tails)

My error is

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y'

This procedure for estimating $\alpha$ is somewhat error-prone. You may wish to read Clauset et al. Power-law distributions in empirical data. — Sycorax, Nov 30 '22 at 21:41
You will find it instructive to plot your data. Generally, you can get substantial information by plotting the empirical survival function (CCDF) on log-log axes, because its slope should be asymptotically $-\alpha.$ The point is that you don't want the central part of the data or even the left tail to screw up your analysis of the right tail, which is what you're interested in. — whuber, Nov 30 '22 at 21:59
More specific about the coding error. Your y includes a zero in the range and that gives an error when you take the logarithm and pass the result to lm. — Sextus Empiricus, Nov 30 '22 at 22:15
Where did you get the 2.5 limit from? That sounds interesting. I haven't heard of forecastable and wonder what it means. Larger $\alpha$ means less fat tails. Why would that mean less forecastable? — Sextus Empiricus, Nov 30 '22 at 22:21
I got the alpha level from a review of forecasting from the International Journal of Forecasting 38 (2022) page 731 — Alex, Nov 30 '22 at 22:29
Use y = log(1- seq(0.01,0.99, 0.01)) instead. This removes log(0) from your tibble. (Then adjust the number of random draws you're using.) — Sycorax, Nov 30 '22 at 22:48
@SextusEmpiricus also see https://en.wikipedia.org/wiki/Fat-tailed_distribution#The_extreme_case:_a_power-law_distribution — Alex, Nov 30 '22 at 22:53
@Alex does the article contain the same information as the Wikipedia page? The latter only tells about definitions of fat tails, but not about the forecastable limit at 2.5 — Sextus Empiricus, Nov 30 '22 at 22:56
It is not the same information, but the paper has the following logic. n=1 is not forecastable then in well behaved cases. for fat tails n going to infinity is equivalent to n=1 in the well behaved case. It does not result in an expectation (cauchy-like). If no expectation exists then it is a fools errand to forecast. — Alex, Nov 30 '22 at 23:04
At a risk of repeating myself, plot your tibble. It will be immediately evident what the problem is that raises the error and even more evident that a linear model isn't going to work at all. — whuber, Nov 30 '22 at 23:17
@whuber I plotted the data and have no additional insights as to why a linear model would not work — Alex, Dec 01 '22 at 00:05
It you used scatterplots, you would see that the points don't come anywhere near falling along any line. — whuber, Dec 01 '22 at 00:23
I think most CDFs can be very well approximated with a polynomial of 6 or less — Alex, Dec 01 '22 at 00:32
The proposed model is that $\log(y) = \alpha \log(x)$. But your plot has linear axes, not logarithmic axes. — Sycorax, Dec 01 '22 at 00:51
@Alex I've now read the passage and understand it better. You paraphrased it a bit strong and also with the inequality of the bounds inversed so that made me confused. What the article tells is "a conservative heuristic is to manage variables with α ≤ 2.5 as practically unpredictable". And 'predictability' relates to the standard practice of estimating the mean of a population. For heay tailed distributions, one can still make reasonable predictions that don't rely on the mean, but on quantiles/order-statistics instead (unless tails are super-heavy) — Sextus Empiricus, Dec 01 '22 at 07:31
@SextusEmpiricus Thanks for expounding. You are correct, in terseness I errored on stating a stronger. For the heavy tailed distributions I would assume that an instance of an order-statistics could be a random walk process and use the last value as the best forecast. — Alex, Dec 01 '22 at 14:38
The comment about polynomial approximations to CDFs is mistaken in two ways. First, your applications of lm fit lines to the inverse CDF (the quantile function), which cannot possibly be remotely close to polynomial: it has vertical asymptotes at 0 and 1. Second, any polynomial approximation of a CDF of any degree is doomed to fail dramatically in the tails and, moreover, is unlikely to be monotonic. For the third time: a scatterplot will reveal all. Superimpose your lm fit on that to see just how bad it is. — whuber, Dec 01 '22 at 14:43
Thanks for teaching me something new. Also, I would superimpose the lm fit, but I am not sure my lm fit is parameterized correctly. — Alex, Dec 01 '22 at 14:47

Sextus Empiricus · Accepted Answer · 2022-12-01T12:06:06.760

There are several issues with this question.

The error message

The simplest, is the issue about the error message which is the explicit question in the text.

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : NA/NaN/Inf in 'y'

The error says that the dependent variable in the linear model is not right and contains NA/NaN/Inf. The reason is because your $y$ variable contains a zero and when you take the logarithm of this then you get an NA value. Then, when you pass this to the lm function you get the error. (Because you pass log(y) nested inside the lm function this is not so clear, but the 'y' of the lm function is your 'log(y)' value and not your 'y' value)

Sidenote: to fit a powerlaw with linearisation you should use $\log(y) = a + b \cdot \log(x)$. In your code you use $\log(y) = bx$ and you miss the intercept as well as taking the logarithm of $x$.

The fitting of the power law

Distributions that have power law behaviour are often only having this behaviour for a limited range. In your fitting method you should only fit the part of the distribution that follows the power law.

In the log-log plot below you see that you don't have a straight line over the entire range, in addition, the points in the tail are the ones with a large scatter and error. If you plot the points along with the known underlying distribution you see that the error is not just the scatter but also the error is correlated and the entire curve can have an error.

On the plot I also have added a log-normal distribution. It shows that curves that are not really straight can appear to be straight. Just fitting a straight line does not tell that you also actually have a straight line.

The article mentioned by Sycorax on the comments, Power-law distributions in empirical data, discusses this issue on more detail.

This is great! Can you edit this response to show how I can find the alpha of the normal and cauchy? You say that the parameter is only sufficient for a limited range. I am curious how Taleb and other forecasters in this area handle this nuance. — Alex, Dec 01 '22 at 14:43
@Alex the normal distribution doesn't follow a power law. Trying to find a fit with a powerlaw would give a false result which would be found out once you try to test the goodness of fit. — Sextus Empiricus, Dec 01 '22 at 14:50
All I want to do is classify distributions as fat tailed or not. I thought that alpha being less than 2.5 or not was straight forward. It seems I was naive. How can I classify a distribution as being fat or not. — Alex, Dec 01 '22 at 14:51
@Alex the link given by Sycorax refers to a software package poweRlaw https://aaronclauset.github.io/powerlaws/ I do not know the package, but if it is the same as the article then it will use a fitting method that is different from the least squares method (it uses the maximum likelihood estimate), it will make an estimate for the cutoff value for the range where the distribution starts to follow a power-law (for this it uses some method that optimized the deviance between the fit and the data), and it will verify the result by goodness-of-fit test which can be a KS test or a LR test. — Sextus Empiricus, Dec 01 '22 at 14:54

score 3 · Answer 2 · answered Dec 01 '22 at 11:03

You might want to use the `tailplot` function in the `utilities` package

The standard way of examining tail behaviour of data is through a tail-plot or a Hill plot (or variations of these). The tailplot shows the tails of a dataset against the empirical tail probability, each exhibited on a logarithmic scale. The plot can be generated in R by using the tailplot function in the utilities package. This function allows you to plot the data in one or both tails using a chosen proportion of the dataset (by default the plot will show 5% of the data in each tail) and it will compare this with a specified power-rate of decay (by default it is compared with cubic decay, which determines finiteness of the variance.

In the code below I give an example of a tailplot for a set of $n=1000$ datapoints generated from a standard normal distribution. The plot shows that the tails of the distributin decays substantially faster than cubic decay, which is sufficient to give finite variance. You can compare your data with an alternate rate-of-decay if you prefer.

#Set some mock data
set.seed(1)
DATA <- rnorm(1000)
#Show the tail plots
library(utilities)
tailplot(DATA)

Note that the tailplot function also allows you to include a Hill-plot and/or De Sousa-Michailidis plot used for estimating the rate-of-decay of the tails. To include these plots, just set hill.plot = TRUE and/or dsm.plot = TRUE.

This is really interesting. How can I find the alpha? I would like to classify my distributions as fat or not with alphas being less than or greater than 2.5 — Alex, Dec 01 '22 at 14:49
You can use the Hill plot or DSM plot for this purpose --- have a look at De Sousa and Michailidis (2004) for discussion. — Ben, Dec 02 '22 at 04:47

Is my data fat tailed in terms of alpha

Try 2

2 Answers2

The error message

The fitting of the power law

You might want to use the tailplot function in the utilities package

You might want to use the `tailplot` function in the `utilities` package