3

Background
Compositional data ($x_i>0, \sum_i x_i=c$) are usually analyzed using some kind of log-transformation (alr/clr/ilr), to take into account naturally the fact that, in presence of the sum constraint, only the scaling of data values are of importance (see here, here and also this answer). A particular obstacle to this approch is posed by the data containing zeros, which requires imputation of these zeros using one of the available imputation strategies - see here. A less inhibiting difficulty is having to work on a simplex, with non-intuitive geometry (although I suppose the intuition appears with practice).

Possible alternative (Hellinger distance)
As a possible alternative is performing a square root transformation, $y_i=\sqrt{x_i}$, which results in the data points being confined to a sphere, $\sum_i y_i^2=c$. One can then introduce distance as cosine distance (or geodesic distance), which permits application of other statistical techniques. The advantages are:

  • spherical geometry (particularly the distances in this geometry) is more intuitive
  • there is no need for imputation of zeros
  • that $y_i^2>$ may facilitate application of methods/algorithms (statistical or numerical) that do not preserve $y_i$ positive.

The particular disadvantage is not taking directly into account the scaling nature of the data.

Question
What are the particular pitfalls that I may be missing here (but which explain why this approach does not seem broadly used)? Are there alternative ways of dealing with compositional data containing zeros?

Update
Another particularity of this approach is that the distance is bounded from above - this could limit application of some statistical techniques, where the variables should be able to span the whole real axis.

Example
The particular application that I have in mind is the gene count data originating from metagenomic analyses of bacterial species. These typically come in the form of count tables, giving a number of times a gene was detected in a sample (more precisely, the number of sequencing reads that mapped onto this gene). The sum over all the genes is referred to as sequencing depth. The zeros may appear either because the corresponding genes are absent from the sample (i.e., the species carrying these genes are absent) or because the genes (species) are present in a very low concentration, undetected at the given sequencing depth.

Update 2
Table 1 of this article presents a range of distance measures used for the analysis of microbial data. In particular, it povides alongside the Aitchison distance the hypesphere-base measures, such as Battacharyya and Hellinger distances.

Update 3
Several transformations that allow working with zeros are listed in this thread.

Roger V.
  • 3,903
  • 1
    Michael Greenacre's book https://www.routledge.com/Compositional-Data-Analysis-in-Practice/Greenacre/p/book/9781138316430 includes gentle propaganda for correspondence analysis as a useful method that doesn't imply the absence of zero. This doesn't solve all problems, but it can be useful. – Nick Cox Apr 12 '21 at 07:49
  • 1
    Have you seen https://stats.stackexchange.com/a/259223/919? – whuber Oct 10 '22 at 16:42
  • @whuber this very informative answer is already mentioned in the OP (in the "Background" section.) – Roger V. Oct 12 '22 at 11:06
  • I missed that because the characterization in that section is incorrect: this is not "some kind of log transformation:" it's a power transformation, which means that for positive powers it will work with zeros. – whuber Oct 12 '22 at 12:15

1 Answers1

3

A few thoughts.

  1. The log transforms you cited are really log transforms of the ratios which you see all through statistics, e.g. logit $=\log(\frac{p}{1-p})$. The log of a ratio linearizes the computation because $\log(\frac{a}{b}) = \log(a) - \log(b)$. This has nice computational properties. If it were only a Box-Cox transform of the raw data (not ratios), then a $\mathrm{sqrt}$ could be just as good as a $\log$, or even better in the presence of zeros.

  2. The log transform takes numbers and ratios on $(0, \infty)$ to $(-\infty, \infty)$. Again, this has nice statistical properties in that the results are on the same support as the Normal distribution and that it easier for solvers to find solutions to unconstrained optimizations. The $\mathrm{sqrt}$ keeps the ratios on $(0, \infty)$.

  3. The problem of zeros in compositional data sometimes involves a Minimum Detectable Level. You will see statisticians overwrite record values below the minimum detectable level as "0" or even "MDL" in their analysis. Neither is correct. The best method is to treat those values as censored so that any likelihood calculation incorporates the uncertainty in the actual experimental procedure.

Nick Cox
  • 56,404
  • 8
  • 127
  • 185
R Carnell
  • 5,323