1

I'm building a simulation that outputs data in distributions and I need a metric to know how similar they are to ground-truth data. Specifically, I output things like this, which is a distribution of overweight/obesity levels in the population:

BMI Distribution

That image is from national data and is therefore "the truth." My simulation outputs similar distributions, and I can break it down into unique BMI buckets, or by age, etc. I need a way to determine how similar the distributions are, but don't know of good ways to do this as my background isn't in statistics. What say you, statistics experts? What options do I have?

  • You have to specify in what sense you care about 'similarity'. Do you need the means to be the same, the SD's, the tails, etc. W/o that, nothing can be said. Also, we usually measure how different they are, rather than similar, for logical reasons. You might want to read: Similarity measure between multiple distributions. – gung - Reinstate Monica Feb 28 '18 at 01:49
  • That's helpful to know: I'm in "unknown unknowns" territory as this is outside my expertise. I suppose I'm looking at overall shape. The above distribution is normal-ish; perhaps I should be testing for mean and SD similarity, then? What about non-normal distributions? – Dylan Knowles Feb 28 '18 at 02:22
  • What would constitute "determin[ing] how similar" they are? You have a plot already, you could just look at that. There are other plots that might help. Do you need a number? (For what?) Do you need a test? (For what?) Would it be OK to have several numbers? Are you trying to get a measure of the similarity? Are you trying to rule out poor outputs? Etc. – gung - Reinstate Monica Feb 28 '18 at 02:45
  • I'm trying to rule out poor outputs. Essentially, I want to produce automated tests to determine that the difference between my output distributions and the real distribution isn't "too high". Up to now I've been eyeballing the distribution (quite happy with it!) but I need something automated. Maybe I should break it down into averages for different bands? (E.g., 8% have a BMI of 25, 2% a BMI of 34, and the model can only have a percent error of 30%.) I simply don't know if there's an established way to do this. – Dylan Knowles Feb 28 '18 at 03:27
  • The BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m2, resulting from mass in kilograms and height in metres. However, metabolism does not scale by body surface area. BMI is an empirical measurement with no physical relationship to how fat someone really is. If you had the original data used to create the BMI, and changed the measurement to be physically motivated, then perhaps the problem could be addressed scientifically. Failing that, any test performed is suspect. – Carl Feb 28 '18 at 04:56
  • 2
    ... yes, Carl, but that doesn't help answer the question. BMI is simply one example output. – Dylan Knowles Feb 28 '18 at 23:10
  • Maybe kullback-leibler distance, see https://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758 – kjetil b halvorsen Apr 08 '18 at 09:16
  • Bin the two distributions, then calculate the KL divergence between the two binned distributions. – lynnjohn Apr 08 '18 at 13:32

0 Answers0