4

If you are running K-medians, and your distance metric is the L1 norm, how do you derive that the center of each centroid is the median of the data points assigned to it?

Second, how do you compute the geometric median?

Third, are there any implementations of k-medians algorithm?

lars
  • 297
  • 3
  • 10
  • check out the pam and clara algorithms in the R cluster package. There is no definition for a multidimensional median, these 2 algorithms are two implementations that are probably closest to what you want. – bdeonovic Mar 12 '14 at 01:15
  • @Benjamin. Change median to geometric median. You mean to tell me that no one has ever implemented K-medians algorithm? – lars Mar 12 '14 at 01:21
  • I guess I misspoke, the package flexclust in R implements k-medians http://cran.r-project.org/web/packages/flexclust/index.html – bdeonovic Mar 12 '14 at 12:27
  • Also on CRAN: Gmedian has k-Gmedian clustering (Gmedian is geometric median) – kjetil b halvorsen Apr 02 '17 at 19:47

2 Answers2

3

The definition of the geometric median is that of the $L_1$ optimum.

There seem to be two common approximations in use:

  • component-wise medians, optimizing each dimension independently
  • medoids, taking only the data samples into account

It's not clear to me why the component-wise median is not the same as the geometric median.

  • 1
    It is. Could you elaborate on the difference between medoids and component-wise medians? – lars Mar 18 '14 at 15:17
  • 1
    Read the description of the "partitioning around medoids" algortihm. It's not based on the median (which might not be part of your data), but it's the most central of the cluster elements. – Has QUIT--Anony-Mousse Mar 19 '14 at 02:09
2

How do you compute the geometric median? By solving an optimization problem, it would be very optimistic to expect some closed formula.

Below some R code, solving it by hand, using the optim function:

library(Matrix)  
Norm  <-  function(X, x) {
    X  <-  sweep(X, 1, x)
    n  <-  NROW(X)
    sum  <-  0
    for (i in 1:n)  sum  <-  sum + Matrix::norm(X[i, , 
                               drop=FALSE], "F") 
    sum
}

set.seed(7*11*13)
X  <-  matrix(rnorm(10000, 5, 3), 1000, 10)
x  <-  rep(4, 10)

Norm(X, x)
[1] 9791.955

Then some code for finding the geometric median using optim, using the componentwise median as initialization:

> optim(apply(X, 2, FUN=median), function(x) Norm(X, x), method="BFGS")
$par
 [1] 5.063750 5.036455 5.075232 5.047033 4.827953 5.052046 4.926166 5.021678
 [9] 5.052985 5.036226

$value [1] 9277.035

$counts function gradient 33 9

$convergence [1] 0

$message NULL

There is also a package on CRAN implementing the geometric median, let us try that:

> library(Gmedian) Gmedian(X, init=apply(X, 2, FUN=median)) >
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 5.015206 5.167069 5.028594 5.125111 4.911885 4.984446 5.088421 5.012249 [,9] [,10] [1,] 5.13452 4.954543

This is certainly faster, but the results show some differences ...