0

I am clustering customers using their stay time on our web sites. When I only use one variable, time, for K-Means clustering with 10 clusters, customers look unevenly distributed to each clusters.

However, when I used 5 variables, time with 4 different categorical variables, K-Means clustering tends to cluster even number of customers to each cluster. I'd like to know why this happens and which clustering results are good or bad between evenly or unevenly distributed data and why clusteres tend to cluster evenly distributed when I added more variables. Here I mean "(un)evenly" as the "(un)even" number of points assigned to each cluster.

Here is the picture of clustering result with time variable only. My applogies to the low quality of drawing.

enter image description here

Here is another picutre of clustering result with 3 principal components from 5 different variables(time and 4 different categorical variables).

enter image description here

xabzakabecd
  • 3,455
  • 1
    It is unclear in what sense you use word "(un)evenly" here. Can you give pictures of your results, to see? – ttnphns Apr 08 '22 at 22:51
  • I mean the number of data points assigned to each cluster by "(un)evenly". I have edited my question with a few modifications here and there. Hope it makes further sense. – xabzakabecd Apr 09 '22 at 13:02
  • I marked it as a duplicate of the question that regards the same topic. Check the link above for the answer. – Tim Apr 09 '22 at 13:41
  • @Tim My question might be different from the one you thing it is a duplicate of another question. My question states that why clustering results look different when I add categorical variables and what effects the categorical variables do to K-Means clustering. – xabzakabecd Apr 09 '22 at 13:51
  • 1
    @oceanus adding or removing any variables could change any clustering solution, nothing surprising here. Keep in mind that k-means is sensitive to scaling and binary variables probably differ from the rest. – Tim Apr 09 '22 at 13:56

0 Answers0