Distance matrix for multiple compositional data

Question

Each data point in my data set is a "multiple compositions" (I'm not sure if this is the correct word for this kind of data). For example:

Data point $X_i = \{a_1, a_2, a_3, b_1, b_2, b_3, c_1, c_2, c_3\}$

$a_1 + a_2 + a_3 = 1,\ b_1 + b_2 + b_3 = 1,\ c_1 + c_2 + c_3 = 1$

$a_j, b_j, c_j > 0\ \forall j=1..3$

I want to do hierarchical clustering on this data, but I don't know how to define the distance matrix for it. The literature in compositional data seems to deal with single-composition data only (e.g., package $compositions$ in R).

Could anyone help to suggest a solution for it please?

EDIT: Some additional details on my case:

I'm doing an analysis on behavior of optimization algorithms on a set of instances. For every pair of (algorithm A, instance I), a composition of three components (as ratios) is generated. Such a composition represents behavior of algorithm A on instance I. Now I would like to cluster these algorithms based on their behaviors on the instance set, using hierarchical clustering with average linkage method.

May I just calculate the distance matrix for each composition based on Aitchison's distance, and sum them up for every algorithm pair?

It depends on the purpose of your analysis, because there are many ways to combine distances on single compositions into distances on vectors of compositions. If you could add some information to your post to explain more of the context and your objectives, then people will be more likely to provide useful and appropriate answers. — whuber, Nov 09 '15 at 21:52
Thank you. I've added more detail description. Hopefully it is clear now. You said that there are many ways to combine, would you please give me some examples? (Sorry if my question is trivial, I'm a beginner in this field) — ndang, Nov 10 '15 at 00:33
So, for each case (algorithm) you have three triplets of components (as proportions), right? My immediate thought is to drop any one from each triplet. Since a1+a2+a3=1 one of the 3 is redundant. You are left with the vector of length 6 (say, a1,a2,b1,b2,c1,c2) for each case. Now compute, between cases, any distance you see reasonable. Euclidean seems ok for me. Or dot product, if you wish a similarity. If you don't like the suggestion - tell why, please. — ttnphns, Nov 10 '15 at 08:16
As you suggested, you can calculate the Aitchison distance for each composition. Summing the resulting 3 distances it depends on what are the 3 compositions. For example, if you had the information of weight, height and age of 2 people, how would you define their distance? Your proposal is to define the distance between them as {difference between weight} + {difference between height} + {difference between ages}? Maybe, you can standardize (function scale in compositions package) each composition before calculating the distances. — marc1s, Nov 10 '15 at 12:24

score 1 · Answer 1 · edited Mar 29 '16 at 13:50

I believe you could CLR (centered log-ratio) transform each of the independent compositions ${a,b,c}$ and then calculate the euclidean distance of the CLR transformed data. e.g.,

$a^* = (clr(a_1), clr(a_2), clr(a_3))$ where the geometric mean $g(x)$ here is taken to be the geometric mean of the $a$ composition.

I would propose the following distance to use:

$dist(X_i,X_j) = \sqrt{(a_{i,1}^*-a_{j,1}^*)^2+(a_{i,2}^*-a_{j,2}^*)^2+\dots + (b_{i,1}^*-b_{j,1}^*)^2+\dots}$

Distance matrix for multiple compositional data

1 Answers1

Linked