9

I have some points that I need to divide into groups based on clusters of points:

x <- c(-1.482156, -1.482318, -1.482129, -1.482880, -1.485735, -1.485770, -1.485913, -1.484275, -1.485866)
y <- c(54.90083, 54.90078, 54.90077, 54.90011, 54.89936, 54.89935, 54.89935, 54.89879, 54.89902)

library(sp)
xy <- SpatialPoints(matrix(c(x,y), ncol=2))
proj4string(xy) <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84")
xyTransformed <- spTransform(xy, CRS("+init=epsg:27700 +datum=WGS84"))

plot(xyTransformed)

enter image description here

Each group should be defined using this rule: all points in the group should be within 40m of each other.

Based on the points in code above, I believe I should end up with 3-4 groups.

Does anybody know how to do this in R?

PolyGeo
  • 65,136
  • 29
  • 109
  • 338
luciano
  • 1,157
  • 1
  • 12
  • 21
  • There are 45 posts related to clustering points in R. I am sure you will find many solutions by searching our site for them. – whuber Jun 25 '13 at 18:39
  • Thanks Whuber. Indeed I searched the site before I posted. i didnt look at every post, but of the post I did check, I couldn't find one that contained reproducible R code – luciano Jun 25 '13 at 19:01

1 Answers1

12

You can use a hierarchical clustering approach. By applying hclust and cutree you can derive clusters that are within a specified distance. Another way is to use the spdep package and calculate a distance matrix using dnearneigh. If you would like code for the dnearneigh approach let me know and I will post it.

require(sp)
require(rgdal)
d=40  # Distance threshold         

# Create example data and transform into projected coordinate system
x <- c(-1.482156, -1.482318, -1.482129, -1.482880, -1.485735, -1.485770, -1.485913, -1.484275, -1.485866)
y <- c(54.90083, 54.90078, 54.90077, 54.90011, 54.89936, 54.89935, 54.89935, 54.89879, 54.89902)

xy <- SpatialPointsDataFrame(matrix(c(x,y), ncol=2), data.frame(ID=seq(1:length(x))),
                               proj4string=CRS("+proj=longlat +ellps=WGS84 +datum=WGS84"))

xy <- spTransform(xy, CRS("+init=epsg:27700 +datum=WGS84"))

chc <- hclust(dist(data.frame(rownames=rownames(xy@data), x=coordinates(xy)[,1],
              y=coordinates(xy)[,2])), method="complete")

# Distance with a 40m threshold  
chc.d40 <- cutree(chc, h=d) 

# Join results to meuse sp points
xy@data <- data.frame(xy@data, Clust=chc.d40)

# Plot results
plot(xy, col=factor(xy@data$Clust), pch=19)
     box(col="black")
       title(main="Clustering")
        legend("topleft", legend=paste("Cluster", 1:4,sep=""),
                   col=palette()[1:4], pch=rep(19,4), bg="white")
Jeffrey Evans
  • 31,750
  • 2
  • 47
  • 94
  • does h=200 in the cutree function definitely refer to 200m? – luciano Jun 25 '13 at 19:03
  • yes, as stated in the cutree help "h is a numeric scalar or vector with heights where the tree should be cut". – Jeffrey Evans Jun 25 '13 at 19:06
  • if I replace coordinates(meuse) with x and y, as defined in my opening post, I need to use h=0.0009 to get 4 groups. So that is a 0.09cm distance threshold. So it doesn't like h=200 is in metres – luciano Jun 25 '13 at 19:29
  • I made my example specific to your data. Your transform was not working. Even though you can call spTransform from "sp" it is actually an "rgdal" function. For some reason "sp" is not adding it as a dependence so you have to explicitly add the "rgdal" package. The above code works with the 40m threshold and returns 4 clusters at this distance. – Jeffrey Evans Jun 25 '13 at 21:45
  • @JeffreyEvans: If you don't know how many clusters develop based on your chosen d value - will this code still group and colour them accordingly? I ask because I see col(palette()[1:4]. – val Apr 10 '18 at 20:44
  • 1
    @val, look at the object resulting from cutree (chc.d40), there are 4 unique values. I am not telling hclust to make 4 clusters, this is just the result when I apply a 40m distance constraint using cutree. If you want an explicit number of clusters with a distance condition then you have to implement a different approach such as kNN or a rule based model using a distance matrix. – Jeffrey Evans Apr 10 '18 at 20:55
  • @JeffreyEvans: thanks - i see i can just do max(chc.d40) to get the number of clusters - in this specific case 4. – val Apr 10 '18 at 21:16
  • 1
    @val Yes, but it would be safer to use something like: length(unique(x)) to return number of clusters. – Jeffrey Evans Apr 10 '18 at 21:57
  • @JeffreyEvans interesting approach. However when I try to plot I get this error Error in as.double(y) : cannot coerce type 'S4' to vector of type 'double' – Herman Toothrot Dec 31 '19 at 16:41
  • @JeffreyEvans Not working for me. My points are in +proj=longlat +ellps=WGS84 +datum=WGS84, and if I try for example d=40 the clustering doesn't make sense, some points that are closer to each other are assigned in separate clusters compared to two points that are farther away. – Herman Toothrot Dec 31 '19 at 17:16
  • 1
    @HermanToothrot this is because d represents a distance in meters and your distance units would be in decimal degrees. You can either project your data to a distance based projection or figure out the appropriate decimal degree value representing your desired distance. – Jeffrey Evans Dec 31 '19 at 19:08
  • @JeffreyEvans I have a similar problem to HermanToothrot. I projected my SpatialPointsDataFrame in the same way you did, using a metric projection (e.g. first coordinates value [183872.5353 -773835.2]). However the final plot shows that some points that are closer to each other are assigned in separate clusters compared to two points that are farther away. – FAmorim Jul 03 '21 at 07:12