2

A variable, which should contain percents, also contains some "ratio" values, for example:

0.61
41
54
.4
.39
20
52
0.7
12
70
82

The real distribution parameters are unknown but I guess it is unimodal with most (say over 70% of) values occurring between 50% and 80%, but it is also possible to see very low values (e.g., 0.1%).

Is there any formal or systematic approaches to determine the likely format in which each value is recorded (i.e., ratio or percent), assuming no other variables are available?

Orion
  • 214
  • 4
    I'm voting to close this question as off-topic because it is impossible to definitively answer. If you don't know what the data mean, how will strangers on the internet know? – Sycorax Mar 04 '19 at 21:17
  • 2
    What the data mean != what is the (data) mean. – Nick Cox Mar 04 '19 at 21:27
  • 1
    You have 3 options: your big numbers are falsely big, and need a decimal in front; your small numbers are falsely small and need 100x multiplie; or your data is just fine. Why don't you plot the qqnorm of all three options? – EngrStudent Mar 04 '19 at 22:07
  • 2
    There are plenty of potentially efficient ways to approach this. The choice depends on how many values are 1.0 or less and how many values exceed 1.0. Could you tell us these quantities for the problem(s) you have to deal with? @EngrStudent The interest lies in (hypothetical) situations where some of the very low values actually are percents. That can lead to exponentially many options (as a function of the dataset size) rather than just three (actually two--two of you options lead to the same solution). – whuber Mar 04 '19 at 22:46
  • 7
    I'm guessing that "ask the people who collected the data" isn't a valid option, here? – nick012000 Mar 05 '19 at 02:45
  • I have voted to close, but I do find the question interresting. There is not really a formal method for this problem (except the prescription to improve the data gathering). But, some custom approach can be designed. It is lacking more precise practical information what the problem is about. – Sextus Empiricus Mar 05 '19 at 17:19
  • The data being unimodal can help and the distribution for >1% can help to place a posterior probability on whether an observed value <1.0 is a ratio or percentage. – Sextus Empiricus Mar 05 '19 at 17:26
  • @Martijn Certainly there are "formal methods" to deal with this, even nonparametric ones. How about a modification of a mixture model, for instance? – whuber Mar 05 '19 at 18:11
  • @whuber, sure you can deal with this in a formal way. I may have misinterpreted formal. I was more thinking like there is no common off-the-shelf standard cookbook method that deals with this issue. I did not think of your idea of a mixture model, which can be indeed an example case with a formal method available. But, we do not know what is in the OP hands and what he or she is trying to achieve. Is it really a mixture model or might the data be mixed up in a more complicated way? Does the OP have a distribution on which the mixture model can be based? What is the OP's objective? – Sextus Empiricus Mar 05 '19 at 19:24
  • The question "to determine the likely format in which each value is recorded" is different from the problem of fitting a (mixture) model to the ensemble of data. I also wonder how you are gonna fit the mixture model with the currently provided data/information. I see this more as craft than science. – Sextus Empiricus Mar 05 '19 at 19:27
  • @Martijn The link with mixture models is that we might view the units of measurement of each value (percent or decimal) as a binary latent value and use maximum likelihood to estimate those values. Although ML would seem to require a parametric model, if one adopts a reasonably flexible family--and in this case even a Normal family will do--it turns out the results can be quite robust to the distributional assumption. – whuber Mar 05 '19 at 21:33

2 Answers2

5

Assuming

  • The only data you have is the percents/ratios (no other related explanatory variables)
  • Your percents comes from a unimodal distribution $P$ and the ratios come from the same unimodal distribution $P$, but squished by $100$ (call it $P_{100}$).
  • The percent/ratios are all between $0$ and $100$.

Then there's a single cutoff point $K$ (with $K < 1.0$ obviously) where everything under $K$ is more likely to be sampled from $P_{100}$ and everything over $K$ is more likely to be sampled from $P$.

You should be able to set up a maximum likelihood function with a binary parameter on each datapoint, plus any parameters of your chosen P.

Afterwards, find $K :=$ where $P$ and $P_{100}$ intersect and you can use that to clean your data.

In practice, just split your data 0-1 and 1-100, fit and plot both histograms and fiddle around with what you think $K$ is.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
djma
  • 779
  • I don't think that this addresses the question. This approach establishes two intervals $(0.0, K], (K, 1.0]$, where one is proposed to be multiplied by 100 and a second left as-is. OP is asking how to determine which values should be multiplied by 100; based on the description in the question, the "squashed" values can appear anywhere in $(0.0, 1.0]$, not solely on one side of $K$ or the other. – Sycorax Mar 06 '19 at 16:38
  • @Sycorax indeed they can appear anywhere, but without any additional information, that's the best we can do. The hope is that the output of this exercise is better than doing nothing for whatever purpose OP had in mind. E.g. if the OP needs an estimate of the mean of that dataset, s/he would be better off using the "K adjustment" than not doing so. – djma Mar 07 '19 at 20:54
0

Here's one method of determining whether your data are percents or proportions: if there are out-of-bounds values for a proportion (e.g. 52, 70, 82, 41, 54, to name a few) then they must be percents.

Therefore, your data must be percents. You're welcome.

  • 3
    The issue is that the two are mixed together. It’s not all percents or all ratios/proportions. 49 is a percentage, but 0.49 could be either. – The Laconic Mar 04 '19 at 21:29
  • 3
    If you can't assume there is a unified format for all of the rows, then the question is obviously unanswerable. In the absence of any other information, it's anyone's guess whether the 0.4 is a proportion of a percentage. I chose to answer the only possible answerable interpretation of the question. \ – beta1_equals_beta2 Mar 04 '19 at 21:31