I have demographic data across different districts/neighbourhoods, and would like to find, for a given district, which is its most similar peer district across multiple variables such as size (total population), race, nationality etc. The idea would be similar to the example of the Chicago Federal Reserve’s Peer City Identification Tool (PCIT), which performs hierarchical cluster analysis on cities, using Ward’s method for clustering across the multiple variables.
However, I realised that unlike the PCIT, my data is compositional for some variables, and non-compositional for others. For example, these are my variables of interest for each district:
- Total population (count, e.g. 75,000)
- Share of locals among population (e.g. 55% local, 45% foreigner)
- Share of race among locals (e.g. 70% Caucasian, 20% Asian, 10% Others)
- Share of stay duration among foreigners (e.g. 50% long-term stay, 30% medium-term, 20% short-term)
- Share of nationality among foreigners (e.g. 40% USA, 30% China, 20% German, 10% Others)
Given the above, is my understanding below correct?
- If I would like to entirely avoid dealing with compositional data and log-ratio transformations, I would only take one component for each compositional variable, such that my variables would be (1) total population; (2) share of locals; (3) share of Caucasian among locals, ignoring other races; (4) share of long-term stay among foreigners, ignoring other stays; (5) share of USA among foreigners, ignoring other nationalities. Under this approach, variables 2-5 would just be a continuous variable within [0,1], and I can proceed with finding Euclidean distance then performing hierarchical clustering.
However, if I cannot afford to ignore the above (% Asian, % medium/short-term stay, % China/German are also important) then I must deal with the compositional data. To do so, I could either:
- Find the Aitchison distance instead of Euclidean distance to perform hierarchical clustering. However, would this be valid when one/some of my variables (total population) are non-compositional?
- Or, convert the compositional variables 2-5 using log-ratio transformation, whether additive, centred, or isometric. I would then use these log-ratio-transformed compositional variables alongside the non-compositional variable (total population), and then calculate Euclidean distance for hierarchical clustering. This seems to be the approach proposed here.
Any pointers on dealing with compositional data analysis alongside non-compositional data in multivariate analysis would be most welcome. Thank you!