12

Suppose I have the minimum, mean, and maximum of some data set, say, 10, 20, and 25. Is there a way to:

  1. create a distribution from these data, and

  2. know what percentage of the population likely lies above or below the mean

Edit:

As per Glen's suggestion, suppose we have a sample size of 200.

user132053
  • 123
  • 1
  • 1
  • 5
  • (1) is easy, because there are many solutions. (2) is best done in the context of some assumptions about the distributional shape, for otherwise all you can obtain are mathematical bounds. – whuber Sep 22 '16 at 20:51
  • 3
    You're being taken literally here in comments and answers so far, but a necessary caution (tacit, I think, in @whuber's remarks) is that there are so many distributions compatible with such information that you should not infer that you have enough information to do this at all well or reliably. In particular, if you don't even know the sample size, you can't do much even to think about uncertainty. – Nick Cox Sep 22 '16 at 22:17
  • When you ask about the proportion of the population that "lies above or below the mean" ... are you asking relative to the sample mean or population mean there? Are we talking about continuous or discrete variables? Do we know sample size? – Glen_b Sep 22 '16 at 23:54

3 Answers3

11

I have the minimum, mean, and maximum of some data set, say, 10, 20, and 25. Is there a way to:

create a distribution from these data, and

There are an infinite number of possible distributions that would be consistent with those sample quantities.

know what percentage of the population likely lies above or below the mean

In the absence of some likely unjustified assumptions, not in general - at least not with much sense that it will be meaningful. The results will depend largely on your assumptions (there's not much information in the values themselves, though some particular arrangements do impart some useful information - see below).

It's not hard to come up with situations where the answers on the proportion question may be very different. When there are very different possible answers consistent with the information how would you know which situation you're in?

More details may give helpful clues but as it stands (without even a sample size, though it's presumably at least 2, or 3 if the mean isn't halfway between the endpoints*) you won't necessarily get much of value on that question. You can try to get bounds, but in many cases they won't narrow things down a lot.

* actually if the mean is close to one endpoint you can get some lower bound on sample size. For example if instead of 10,20,25 for your min/mean/max you had 10 24 25 then $n$ would have to be at least 15, and it would also suggest that most of the population was above 24; that's something. But if it were say 10,18,25 it's much harder to get a useful idea of what the sample size might be, let alone the proportion below the mean.

Glen_b
  • 282,281
  • "An infinity of ways?" Isn't this more than a little hyperbolic and condescending? – user78229 Sep 23 '16 at 00:04
  • 2
    @DJohnson I don't think it's hyperbolic -- it's quite literally true (though our ability to actually list them might fail after a few thousand and our ability to care to continue listing them might fail after a few dozen, it doesn't mean there are no other sets of assumptions we could operate under). There was no intent of condescension in my phrasing - it's deliberately chosen to actually indicate the true breadth of possible sets of assumptions. What would you like me to write? – Glen_b Sep 23 '16 at 00:19
  • "More than a few ways?" "Lots of possibilities?" "A plethora of options?" My objection to "an infinity of ways" is that -- and here we'll have to agree to disagree -- is that it can't be "literally true," as you would have us believe. – user78229 Sep 23 '16 at 01:10
  • It's a direct reference to the number of possible assumptions, as I have indicated above. Do you believe that there's literally only a finite number of different assumptions that could be made, beyond which one cannot pass? – Glen_b Sep 23 '16 at 01:21
  • Absolutely! The number of possible one- or two-parameter distributions has to have an upper limit. I believe that even Johnson and Kotz's four volume series would confirm that. – user78229 Sep 23 '16 at 01:25
  • 3
  • What is a reason to restrict the possibilities to two parameters at most? What if the data were drawn from a three parameter lognormal, for example? In many cases we can't estimate all the parameters from the data, but that's part of the problem I am trying to motivate there (it relates to the discussion of assumptions. 2. Johnson and Kotz is a subset of what distributions people have named/worked with, not remotely a bound on what assumptions are possible. I've invented numerous distributions that are not in in Johnson and Kotz, and ... ctd
  • – Glen_b Sep 23 '16 at 01:34
  • 4
    ctd ... I'm pretty sure that they're not all ruled out here. Even with no unspecified parameters, there's an infinity of possible cdfs, a non-finite subset of which would not be ruled out by the specified information. – Glen_b Sep 23 '16 at 01:34
  • You're the expert. I'll defer to you on this but if I made a statement like yours (re "infinity") I can only imagine the pushback it would get. – user78229 Sep 23 '16 at 01:40
  • 1
    @Djohnson Whatever the extent of any remaining disagreement, I appreciate your helpful comments. I will consider whether to at least more clearly indicate what I am really saying (my actual claim is capable of proof, were it needed, but perhaps I can at least state it clearly), and whether it should be differently phrased there. – Glen_b Sep 23 '16 at 05:07
  • 4
    @DJohnson Take two different distributions fulfilling the conditions: any mixture of the two will still satisfy the said conditions. That's literally an infinity: a non enumerable one. – Elvis Sep 23 '16 at 05:56
  • @glen_b I, too, learned something from this thread. One additional comment, classification is a fundamental and universal cultural process. There are two cognitive approaches to this: splitters and lumpers. Splitters tend to amplify small differences and find many partitions. Lumpers, on the other hand, tend to gloss over those small differences and find broader groupings. One approach isn't better than another, they are just different. ctd – user78229 Sep 23 '16 at 11:54
  • ctd As a "lumper" and using Tim's graphs as an example, I might classify them into higher groupings, e.g., Weibull or uniform or triangle, blurring the minute differences. – user78229 Sep 23 '16 at 11:55
  • @DJohnson This is a common behavior for people (we all do a bit of both), but as with many things statistical (and many others less statistical), our instincts may sometimes lead us into doing things we may not enjoy the consequences of. One point I attempted to make in my post is there can be differences that are consequential for the result we're interested in. (This can happen even if drawings of the densities or of the cdfs may seem conceptually fairly similar -- or indeed visually indistinguishable) – Glen_b Sep 24 '16 at 01:33