First of all, data I'm using can be found here.
My code is:
library(readxl)
prouni <- read_excel("C:/Users/e270860661/Downloads/prouni.xlsx",
sheet = "Formatados_para_rodar")
View(prouni)
str(prouni)
coercivo <- data.frame(prouni$mensalidade, prouni$Medicina,
prouni$nota_integral_ampla)
summary(coercivo)
coercivo$dropador <- complete.cases(coercivo)
summary(coercivo$dropador)
final <- coercivo[coercivo$dropador == TRUE,]
final$dropador <- NULL
set.seed(100)
analise <- kmeans(final, centers = 4)
str(analise)
plot(final, col = analise$cluster)
For context, "mensalidade" means "tuition", "medicina" is a dummy variable (1 for a given program being Medicine and 0 for not being Medicine) and "nota_integral_ampla" is equivalent to required SAT score to be approved in the program.
My problem is that clustering doesn't seem to be working "multivariably". The algorithm seems to have chosen tuition thresholds and classified observations considering only these thresholds. Is my intutition right or is kmeans supposed to do this? Is there a coding error?
I'm an economist by training so this is all very new to me, sorry if it's a poorly elaborated question.

Is changing mensalidade to standard-deviations from mean mensalidade something to consider as far as clustering tecniqués go? I mean, it does make sense to the economist in me. I'm more interested in relative tuition costs (thus, prices), not on the exact amount of brazilian reais someone spends on a given program.
Anyway, I'll check dbscan out. Thanks for the tips!
– Pedro Cavalcante Jul 17 '18 at 03:28I've run this code:
And this is the plot. I know some topology and $\epsilon$-radius neighborhoods are not new to me, but interpreting this plot surely is. Also, what are PC1 and PC2?
– Pedro Cavalcante Jul 17 '18 at 08:16That's why we opt to normalise our sample almost by convention; the most common distance (i.e. Euclidean) usually gives coherent similarity scores for points coming from a normalised sample, that is not always the case and that's why we have many different distance metrics. See for example the "Gower distance" (Gower 1971, Biometrika) that works both with categorical and continuous data natively.
– usεr11852 Jul 17 '18 at 18:50