3

Recently, I was wondering about how to "restrict" a statistical model from making predictions beyond a certain range (Preventing Illogical Interoperations of Models?).

For example, in this video (https://www.youtube.com/watch?v=h5aPo5wXN8E&list=PLDcUM9US4XdNM4Edgs7weiyIguLSToZRI&index=3 @ 56:40), a Bayesian Model is created using the Log Normal Distribution when modelling human heights as heights can not take negative values.

After spending some more time reading about this, I came across the idea of "Truncated Probability Distributions" (https://en.wikipedia.org/wiki/Truncated_normal_distribution). As I understand, a Truncated Probability Distribution is a Probability Distribution that is defined only on a "limited range" (i.e. "restricted"). For example, consider the Normal Distribution - we can "truncate" this distribution over the range $a - b$:

$$f(x; \mu, \sigma, a, b) = \frac{1}{\sigma} \cdot \frac{\phi\left(\frac{x-\mu}{\sigma}\right)}{\Phi\left(\frac{b-\mu}{\sigma}\right) - \Phi\left(\frac{a-\mu}{\sigma}\right)}$$

Where: $$\phi(x) = \frac{1}{\sqrt{2\pi}}\exp\left(-\frac{1}{2}x^2\right)$$

This leads me to my question: Suppose I collect some data on how long different people lived and the average amount of yearly income they earned in their life. Suppose I am interested in modelling (e.g. regression) the effect of income on life expectancy. In this problem, it is quite likely to observe an upwards trend in that people with higher incomes likely had the ability to access better quality healthcare and thus lived longer. However, it is also possible that if I use this model to predict the life expectancy of a billionaire, the life expectancy might be around 200 years - and we know that in modern history, no human has ever recorded to live that long.

Thus, suppose if I found out the maximum age a human ever reached - to avoid making such illogical predictions, could I create a GLM Regression Model based on a Truncated Normal Probability Distribution between $a = 0$ and $b$ = max_age_ever_recorded and thus address this problem of illogical predictions? Is this a statistically valid approach? Or is this illogical or unnecessary?

Thanks!

stats_noob
  • 1
  • 3
  • 32
  • 105
  • 4
    It isn't illogical, and it isn't necessary. Not sure what you mean by "statistically valid". – Galen May 01 '23 at 04:26
  • @ Galen: thank you so much for your reply! If you have time, can you please elaborate on this? Why is this approach not necessary? Thank you so much! – stats_noob May 01 '23 at 04:29
  • re not necessary: b/c you might use distributions with non-negative support that fits well with the data. The fact that the distribution allows for the possibility of finite ages arbitrarily above what is observed doesn't mean that it isn't a useful model. When the right-tail decays sufficiently quickly you will not observe stupendous values even in simulation. – Galen May 01 '23 at 14:17

2 Answers2

7

There's a famous quote from George Box that

All models are wrong, but some are useful.

Sure, you can use truncated distribution or other distribution that out-of-a-box has a restricted range. But what would be the upper bound? If you get it wrong, your model would be wrong as well!

However let's suppose that you didn't restrict the range, so what? Yes, your model could say that the life expectancy as a function of income could be 200 years old for a billionaire, so what? First of all, life expectancy is not a function of income, a billionaire may die as any of us in a scenario where their wealth would not change anything. So your model is obviously wrong, as life expectancy as a function of wealth is not the "true" explanation. The explanation could be useful in some scenarios though while remembering the limited applicability of the model. Truncating the distribution would be just lipsticking the pig.

But, of course, if we have good reasons to use models using things like truncated distributions, we do so. But doing this to cover up the fact that the model does not work for some scenarios is not a good reason. In fact, it may hide the problems with the model giving you a false sense of it working properly. It would only force your linear predictions to fit the square hole.

Tim
  • 138,066
  • @ Tim: Thank you so much for your answer! If you have time, can you please think of a similar example where a regression model based on a truncated distribution might be useful? thank you so much! – stats_noob May 10 '23 at 04:37
  • @stats_noob for example, when the distribution is actually truncated, e.g. values above some level were not observed (but exist). – Tim May 10 '23 at 07:51
  • @ Tim: Thank you for your reply! Maybe I am overthinking this, but I find this idea very confusing. In my question about life expectancy and salary ... I might not have "observed" any billionaires or any very old people (e..g over 100 years) ... yet these values (i.e. high income, old age) still exist. Therefore, would this not be an argument in favor of a "regression model based on a truncated probability distribution"? Thank you so much! – stats_noob May 10 '23 at 14:20
  • @stats_noob see https://stats.stackexchange.com/questions/197628/laymans-explanation-of-censoring-in-survival-analysis/198481#198481 – Tim May 10 '23 at 14:56
  • @ Tim: thank you for your reply! I will read this link - I think I understand the concept of Censoring from Survival Analysis better ... still trying to understand when is it suitable to use Truncation. – stats_noob May 10 '23 at 15:07
  • In the present case, perhaps a more pertinent quote is "All models are wrong, but some are really wrong and not at all useful." – Ben May 12 '23 at 00:42
4

With great respect, what you are proposing is a really awful way to model the relationship of income to life expectancy. Before you even get to the choice of models/distributions, you are missing one of the most important statistical phenomena at work in this type of problem: cumulative income is likely to be strongly positively related to total lifetime primarily because it is accumulated steadily over that lifetime --- a person who has twice as much lifetime as an adult will have roughly twice as long to earn money, so they will tend to have a higher cumulative income over their lifetime. This is likely to give a strong statistical effect which will dwarf the effect of the positive statistical relationship between income and health.

If you want to model the relationship between health and income, you should be looking at various standard forms of survival models and you should be aiming to get longitudinal data on the income of people at each year (or other relevant interval) over their lifetime. You can then build a survival model where the conditional probability of survival in each interval is based on a regression of various income variables up to that point (e.g., present income, average or cumulative past income, etc.). If you can collect longitudinal data on health for the same people (or other relevant covariates) then you can also incorporate these variables into your survival regression.

In order to get a feel for this field of analysis, I recommend you read some introductory material on survival analysis. I have never seen a truncated normal distribution used within this field and it has some glaring problems that would make it a poor choice for almost any purpose (e.g., imposing a hard cut-off on the maximum possible age, having a strange shape with some continuity and symmetry but then hard cut-offs, etc.). There are applications for truncated normal distributions in other fields, but this is not an area where they would be fruitful. Proper survival models deal with the death of elderly people by having hazard rates that increase rapidly in old age, making it highly unlikely (but not impossible) for an elderly person to live to a much older age. If you would like a primer on how to do survival modelling that incorporates regression effects based on longitudinal variables, you can have a look at Allison (2014).

Ben
  • 124,856
  • 1
    @ Ben: Thank you so much for your kind answer! In reality, I am just interested in understanding what kinds of situations are meant for Regression Models based on Truncated Probability Distributions - the example of life expectancy and income was an arbitrary example that I thought of. Earlier, I had posted another example where I thought that Regression Models based on Truncated Probability Distributions might be useful, but the question was closed (https://stats.stackexchange.com/questions/614128/preventing-illogical-interoperations-of-models) – stats_noob May 11 '23 at 01:26
  • 1
    Thank you so much for your time to educate me about these topics - I really appreciate it! – stats_noob May 11 '23 at 01:27