7

I have a table with two columns X and Y. Each row represents an aggregate statistics for an instance. I introduce a new column as Z = X / Y, which is another important information on the instance. Now, I want to present the overall statistics of the instances (i.e. Mean).

Here I have a concern: Which one should I use among Mean(X / Y) and Mean(X) / Mean(Y) to represent Mean of Z? Simply, it might be Mean(X/Y) just because Z=X/Y.

However, I have two concerns:

  • Mean(Y) * Mean(Z) != Mean(X); it makes it hard for people to trust the numbers.
  • the differences between Mean(X/Y) and Mean(X)/Mean(Y) are significant. Do the differences themselves tell something meaningful statistically?

// I update my case.

The table keeps the user records on a system. Users can upload data to it.

  • X: the number of uploads
  • Y: the volume of uploads
  • Z: Y/X; volume per upload

What I want to do is to simulate such a system with the workloads that are similar to the real.

I simply create N instances of users (N cannot be too large) with X' = Mean(X) and Z' = Mean(Z).

So during the simulation each user uploads data of total volume: (X') * (Z').

Then when I aggregate the simulation results, I end up with: Mean(Y') != Mean(Y).

syko
  • 183
  • 2
    see ''ratio estimators'' e.g.: http://stats.stackexchange.com/questions/164738/confidence-interval-of-ratio-estimator/164745#164745 –  Sep 06 '16 at 14:21
  • @fcop Hmm, when does the ratio estimation help? When I want to run a simulation with N instances (each has Mean(X), Mean(Y) and Mean(Z) characteristics) based on the statistics? Can I use the ratio estimatior instead of Mean(Z)? – syko Sep 06 '16 at 15:03
  • Can you say more about the context, e.g. what could be X, what could be Y? –  Sep 06 '16 at 15:09
  • @fcop I updated the post. Could you take a look at it? – syko Sep 06 '16 at 15:29
  • @fcop what's the best option to give the both information? not using the mean? – syko Sep 06 '16 at 15:33
  • 1
    I am in the train now, i' ll answer in the evening. –  Sep 06 '16 at 15:34
  • 1
    Have you examined the distribution of volume per upload, or of uploads per user, not just mean values? For simulation you probably should be sampling from the distributions rather than just using mean values, in any event. – EdM Sep 06 '16 at 15:41
  • @EdM Agree. Using mean values does not work considering highly skewed distributions of X, Y and Z. Where can I start with sampling? Randomly selecting N different users from the table? The number of samples I need will depend on the distribution. Right? (Time to study...) – syko Sep 06 '16 at 16:04
  • 1
    Much good advice here but I often find that a mean is unsuitable for summarizing such a ratio even if both quantities are strictly positive. The interval $X < Y$ is mapped to $0 < (X / Y) < 1$ and the interval $Y > X$ is mapped to $\infty > (X / Y) > 1$ which is quite asymmetric. The resulting distribution is often highly skewed, which alone can make means awkward or problematic. The remedy is often to work with logarithm of the ratio and/or (equivalently) geometric means. – Nick Cox Sep 06 '16 at 17:45
  • The nature and numbers of the sampling you need to do might depend on what you are trying to accomplish with your simulation, which is not completely clear from the present version of your question. The more information about your goals that you provide, the more someone might be able to help. – EdM Sep 06 '16 at 17:48

2 Answers2

9

You should present Mean (X/Y) if X/Y is a useful measure and a mean is a useful way to summarize it. By Jensen's Inequality we know that the ratio of the mean is never equal to the mean of the ratio except under some special circumstances.

AdamO
  • 62,637
  • Thanks, I think your answer is correct. However, as I mentioned as one of my concerns, people (who will not care for what the Jensen's Inequality is about) can tend not to believe the numbers. Because Mean(Y) * Mean(Z) != Mean(X) which contradicts to the intuition. What would be the best explanation on it? – syko Sep 06 '16 at 14:47
  • 2
    @syko that's an epistemological problem. Be sure to carefully explain that they are distinct quantities. I don't think your example contradicts intuition. Take Y=-X , X = -1 or 1 with equal probability. – AdamO Sep 06 '16 at 14:57
  • @AdamO I think you've got an error in your calculation; in this case $E(1/Y) > 1$ since (with probability 1) $1/Y>1$. – Richard Rast Sep 06 '16 at 19:21
  • @R.M. What do you mean "limit of two independent random variables"? – AdamO Sep 06 '16 at 19:51
  • @AdamO What I meant was that you have two independent variables, and you construct an infinite number of pairs from samples of the two. -- I realize now that I was wrong about my comment, though, in that I neglected the complexity of the reciprocal. While Mean(X)*Mean(1/Y) = Mean(X/Y) for completely independent variables, you can't say that Mean(X)/Mean(Y) = Mean(X/Y), unless you have a rare distribution of Y such that Mean(1/Y) = 1/Mean(Y). So your intuition is assuming 1. the two variables are independent (uncorrelated) and 2. Mean(1/Y) = 1/Mean(Y), which isn't correct in general. – R.M. Sep 06 '16 at 22:59
5

$Z=Y/X$ may be meaningful for individual users as their individual average volume per upload, but $\text{Mean}(Y/X)$ does not look meaningful in aggregate as some users use the system more than others.

If you took a weighted mean of $Z=Y/X$ to account for this, the natural weights would be the numbers of uploads $X$ and the resulting weighted mean would turn out to be $$\text{Weighted Mean}(Z)=\text{Sum}(X \times Y/X)/ \text{Sum}(X)=\text{Sum}(Y)/ \text{Sum}(X) \\ =\text{Mean}(Y)/ \text{Mean}(X)$$ which would also be the aggregate average volume per upload across the system.

Your concerns are justified: It would probably be better to use the latter option.

Henry
  • 39,459