Why is the coefficient of variation not valid when using data with positive and negative values?

Question

I can't seem to find a definitive answer to my question.

My data consists of several plots with measured means varying from 0.27 to 0.57. In my case, all data values are positive, but the measurement itself is based on a ratio of reflectance values that can range from -1 to +1. The plots represent values of the NDVI, a remotely derived indicator of vegetation "productivity".

My intention was to compare the variability of values at each plot, but since each plot has a different mean, I opted for using the CV to gauge the relative dispersion of NDVI values per plot.

From what I understand, taking the CV of these plots is not kosher because each plot can have both positive and negative values. Why is it not appropriate to use the CV in such instances? What would be some viable alternatives (i.e., similar test of relative dispersion, data transformations, etc.)?

What is the purpose of comparing the variability? Why don't you just compare measures of actual variability, like SD, MAD, range, or whatever, instead of a relative measure like the CV (which makes no sense here)? — whuber, Apr 17 '13 at 21:57
I'm using CV to account for differences in the means between plots. Does it not make sense because the values range between -1 and +1 in all plots? i.e., the "actual variability" would be more indicative of differences between plots? — Prophet60091, Apr 17 '13 at 22:13
CV is a relative measure of variation, by definition. It gives nonsensical results for any negative mean (you can't interpret a negative amount of dispersion or spread). For positive means, it makes a given amount of spread look much larger when the mean is small. When this is wanted, what you're doing is effectively equivalent to comparing your data on a logarithmic scale--and that makes no sense whenever any of the data could be zero or negative. It's possible your data might need some kind of re-expression to allow good comparisons of variability; it depends on how they are generated. — whuber, Apr 17 '13 at 22:16
+1 for explanation. While the means of my plots are all positive, there could be negative values within each plot. Based on the above, and Peter's answer below, it would appear using the CV is not warranted. I'll look at potentially rescaling the values and/or using measures of actual variability. — Prophet60091, Apr 17 '13 at 22:25
If you can sensibly rescale your data by adding a constant, then that would also mean CV is not a good idea. This is because adding a constant will change the CV but not change variation. — Peter Flom, Apr 18 '13 at 11:19
Does anyone have a solution to transform the data so that it can be compared to other sets? Is it possible to convert the data in a set to positives to find a clean CV? — , Jan 17 '14 at 05:07
Welcome to the site, @John. This isn't an answer to the OP's question. Please only use the "Your Answer" field to provide answers. If you have your own question, click the [ASK QUESTION] at the top of the page & ask it there. Then we can help you properly. Since you are new here, you may want to read our tour page, which contains information for new users. — gung - Reinstate Monica, Jan 17 '14 at 06:31

Peter Flom · Accepted Answer · 2013-04-19T10:55:00.917

12

Think about what CV is: Ratio of standard deviation to mean. But if the variable can have positive and negative values, the mean could be very close to 0; thus, CV no longer does what it is supposed to do: That is, give a sense of how big the sd is, compared to the mean.

EDIT: In a comment, I said that if you could sensibly add a constant to the variable, CV wasn't good. Here is an example:

set.seed(239920)
x <- rnorm(100, 10, 2)
min(x)#To check that none are negative
(CVX <- sd(x)/mean(x))
x2 <- x + 10
(CVX2 <- sd(x2)/mean(x2))

x2 is simply x + 10. I think it's intuitively clear that they are equally variable; but CV is different.

A real life example of this would be if x was temperature in degrees C and x2 was temperature in degrees K (although there one could argue that K is the proper scale, since it has a defined 0).

edited Apr 19 '13 at 10:55

answered Apr 17 '13 at 20:12

Peter Flom

119,535
36
175
383

thx! So the concern is more about having a mean near zero, and not necessarily having positive and negative values in your data. If so, how close to a mean of zero is considered "very close"? In my case, I would say I'm far from having my means near zero. Is there a definitive way of determining this? – Prophet60091 Apr 17 '13 at 21:52
No, the concern is that the CV no longer does what it is supposed to do, even if there is only 1 negative value. If you have negative values, don't use CV. Also, if your values are on an arbitrary scale, don't use CV. – Peter Flom Apr 17 '13 at 22:13
For completeness, could you provide a little more explanation as to why using an arbitrary scale invalidates the use of the CV? Thx! – Prophet60091 Apr 18 '13 at 14:58
In all fairness, I think @whuber wasn't advocating the comparison of transformed vs. the untransformed data, but your point is still taken: scaling will affect the CV, when one might think the results should remain the same. +1 for toy R code! – Prophet60091 Apr 19 '13 at 14:48
I have no argument with @whuber 's comments on this thread. – Peter Flom Apr 19 '13 at 14:54

score 0 · Answer 2 · answered Aug 14 '14 at 20:43

I think of these as different models of variation. There are statistical models where the CV is constant. Where those work one may report a CV. There are models where the standard deviation is a power function of the mean. There are models where the standard deviation is constant. As a rule a constant-CV model is a better initial guess than a constant SD model, for ratio scale variables. You can speculate on why that would be true, perhaps based on prevalence of multiplicative rather than additive interactions.

Constant-CV modeling is often associated with logarithmic transformation. (An important exception is a nonnegative response that is sometimes zero.) There are a couple ways to look at that. First, if the CV is constant then logs are the conventional variance-stabilizing transformation. Alternatively, if your error model is lognormal with SD constant in the log scale, then the CV is a simple transformation of that SD. CV is about equal to log-scale SD when both are small.

Two ways of applying stats 101 methods like a standard deviation are to the data the way you got them or (especially if those are ratio scale) to their logs. You make the best first guess you can knowing that nature could be rather more complicated and that further study may be in order. Do take into account what folks have previously found productive with your kind of data.

Here's a case where this stuff is important. Chemical concentrations are sometimes summarized with CV or modeled in a log scale. However, pH is a log concentration.

Thank you for your contribution, and welcome to our site! Could you make it clearer how your answer addresses the question about the validity of using a CV at all to characterize data that can have negative values? That situation would seem not to be covered by any of your remarks. — whuber, Aug 14 '14 at 21:16

Why is the coefficient of variation not valid when using data with positive and negative values?

2 Answers2