0

I have a sample of only 142 numbers from a distribution of 3852 numbers ranging from 0 to 53, but it is censored below 35 (The values ​​exist, but I don't have access.), so I have only the values in the right tail from 35 to 53. Is it possible to estimate what kind of distribution this is, with just these 142 numbers? And also calculate the mean, standard deviation (or other parameters) of this curve? Using Scipy, R, Matlab, or any method?

   sample= [53,53,51,49,49,49,48,48,48,47,47,47,47,47,47,46,46,46,46,46,45,44,44,44,43,43,43,43,43,43,43,43,43,43,43,43,42,42,42,42,42,42,42,42,42,42,41,41,41,41,41,41,41,41,41,41,41,41,40,40,40,40,40,40,40,40,40,40,40,40,39,39,39,39,39,39,39,39,39,39,39,39,38,38,38,38,38,38,38,38,38,38,38,38,38,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,36,36,36,36,36,36,36,36,36,36,36,36,36,36,36,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35]
  • Just as a check: is truncation or censoring involved? https://en.wikipedia.org/wiki/Truncation_(statistics) or https://en.wikipedia.org/wiki/Censoring_(statistics). – JimB May 17 '23 at 03:23
  • I confused the terms. In fact, it's censored. The values ​​exist, but I don't have access. I corrected the question. – Silvio Duarte May 17 '23 at 18:42
  • @SilvioDuarte I suggest please reread the definitions in the linked wikipedia articles a bit more carefully. If you're having trouble, you might describe how the data are sampled. For instance, if they come from an assay that has a lower limit of detection those data are truncated. I think that's the case you're dealing with. – AdamO May 17 '23 at 20:53
  • 1
    "Is it possible to estimate what kind of distribution this is" if you have no information about the distribution then you can not estimate what kind of distribution it is. You need to know at least something about the distribution. For example, if you know some parametric/functional description of the distribution, then possibly the information about the non-censored numbers can give you an idea about the entire distribution. – Sextus Empiricus May 17 '23 at 21:10

1 Answers1

0

You can use the fitdistrplus package in R to fit some distributions.

Here is an example. I start by plotting the data with a histogram. By inspection of the shape, and seeing that it's integer, I guess that it could be described by a Poisson distribution lower-truncated at 34. I then fit this using the fitdist function plus the general truncated Poisson distribution function in the extraDistr package. Finally, plotting it shows that the truncated Poisson fits the existing data well.

library(fitdistrplus)
library(extraDistr)

sample <- c(53,53,51,49,49,49,48,48,48,47,47,47,47,47,47,46,46,46,46,46,45,44,44,44 ,43,43,43,43,43,43,43,43,43,43,43,43,42,42,42,42,42,42,42,42,42,42,41,41 ,41,41,41,41,41,41,41,41,41,41,40,40,40,40,40,40,40,40,40,40,40,40,39,39 ,39,39,39,39,39,39,39,39,39,39,38,38,38,38,38,38,38,38,38,38,38,38,38,37 ,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,36,36,36,36,36,36,36,36,36 ,36,36,36,36,36,36,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35) hist(sample, breaks = seq(-0.5, max(sample) + 0.5, by = 1))

fd <- fitdist( data = sample , distr = "tpois" , fix.arg = list(a = 34) , start = list(lambda = 35) # guess by inspection of histogram , discrete = TRUE )

summary(fd) plot(fd)

This returns a $\lambda \approx 36$, which means that if (and this is a big if, since there is no evidence to back this up) the data truly comes from a Poisson distribution, but all the data below 35 has been truncated, then from standard knowledge about the Poisson distribution, the mean and standard deviation of the ultimate generating distribution are both $\approx 36$.

Alex J
  • 2,151
  • 1
    In your code I cannot see the crucial information that the dataset consists of 3852 observations. Unless that is correctly incorporated in the fit, the answer must be incorrect. In this case $36$ as the mean of a distribution of numbers primarily 53 or below, where 35 is around the 97th percentile, obviously is wrong. – whuber May 17 '23 at 18:51
  • Thanks Alex J, but as JimB asked above, this is actually a censored distribution, and I think your solution is for a truncated one. That's why I thought it was strange that you didn't mention the 3852 values ​​to fitdistrplus. Can you solve like a censored? – Silvio Duarte May 17 '23 at 18:54
  • 1
    If you are certain about the distributional form, up to the specification of a small number of parameters to be estimated, then you can fit these data with maximum likelihood as described at https://stats.stackexchange.com/questions/34882. Your data consist of counts of observations within the intervals $(-\infty,34],$ $[35], [36],\ldots,[53].$ That fit should be followed by a goodness of fit test. In the end, you will only be able to draw conclusions about the tail of the distribution that you observed, because anything could happen among the censored values. – whuber May 17 '23 at 18:59
  • 1
    BTW, Silvio, the censored Poisson parameter estimate is $\hat\lambda = 25.73$ -- but the data clearly don't look anything like the tail of a Poisson distribution (or much like the tail of any standard distribution at all). Perhaps you are asking the wrong question -- why do you want to fit a distribution in the first place? – whuber May 17 '23 at 19:08
  • Thank you, whuber. This data is from the score of a national contest in the country I live in. The 142 known values are from the approved candidates. The company that organized the exams does not make the other numbers available. So I wanted to be able to estimate the average score of the student population and how this distribution occurs. The US SAT test seems to fit a normal distribution, maybe this one too, but I am not sure. – Silvio Duarte May 17 '23 at 20:17
  • @SilvioDuarte if you're unsure of the distribution, you can still use information criterions or other metrics to check the adequacy of the truncated distribution fit, and make a selection from among a panel of possibilities. Word of caution: for 2 or even more model parameters, it's very likely that, even with few censored observations, the extrapolated fit in the tails will be poor. – AdamO May 17 '23 at 21:07
  • Yep I see why that's wrong - sorry @SilvioDuarte – Alex J May 17 '23 at 23:10
  • The answer could be almost anything. For instance, this national contest could be like the Putnam mathematics competition in the US and Canada, which is scored from 0 to 120 points. There have been years when the 98th percentile was under 40 and the median was 0 or 1 point. At the other extreme would be an exam that consists of many extremely easy questions and a few difficult ones, so that (in your case) the majority of scores could be bunched up just below 35. Consequently, your question is unanswerable. – whuber May 18 '23 at 13:08
  • I see... But is it possible to estimate as a normal distribution? – Silvio Duarte May 19 '23 at 18:41
  • Sure -- but the fit would be truly terrible and the estimates of the parameters of that distribution would be hugely uncertain. Moreover, they would put a great deal of probability into the negative numbers, which wouldn't be realistic for this application. – whuber May 20 '23 at 15:30