I have a table with two columns X and Y. Each row represents an aggregate statistics for an instance. I introduce a new column as Z = X / Y, which is another important information on the instance. Now, I want to present the overall statistics of the instances (i.e. Mean).
Here I have a concern: Which one should I use among Mean(X / Y) and Mean(X) / Mean(Y) to represent Mean of Z? Simply, it might be Mean(X/Y) just because Z=X/Y.
However, I have two concerns:
- Mean(Y) * Mean(Z) != Mean(X); it makes it hard for people to trust the numbers.
- the differences between Mean(X/Y) and Mean(X)/Mean(Y) are significant. Do the differences themselves tell something meaningful statistically?
// I update my case.
The table keeps the user records on a system. Users can upload data to it.
- X: the number of uploads
- Y: the volume of uploads
- Z: Y/X; volume per upload
What I want to do is to simulate such a system with the workloads that are similar to the real.
I simply create N instances of users (N cannot be too large) with X' = Mean(X) and Z' = Mean(Z).
So during the simulation each user uploads data of total volume: (X') * (Z').
Then when I aggregate the simulation results, I end up with: Mean(Y') != Mean(Y).