I have a number of vary large rasters which need to be randomly sampled with the return value being a matrix of x, y, and value. The raster package sampleRandom(raster,n, na.rm=TRUE, xy=TRUE) will do this just fine most of the time. When working correctly, this function returns a matrix of non-NA values for n coordinate pairs. When NA values come up in the sample, they are dropped and replaced by a non-NA value.
However, for my rasters (smallest being 4e^7 cells and some having a high percent of NA values), sampleRandom() returns a matrix substantially smaller than n coordinate pairs. Presumably, this due to sampled NA values, not being replaced when they are sampled.
Why does the sampleRandom function return incomplete results on the real-world data example?
Ss @Radar correctly pointed out, the documentation for the raster packages states:
With argument na.rm=TRUE, the returned sample may be smaller than requested
With this, my question becomes; how do I draw a work around this and efficiently draw random sample of n coordinate pairs?
Example 1: this works correctly in retrieving a random sample of n from a larger raster that is cropped and masked by spatial polygons. returns a matrix of 2000 cordiante points.
region1 <- rbind(c(0,0), c(50,0), c(50,50), c(20,20), c(0,0))
region2 <- rbind(c(50,0), c(80,0), c(100,50), c(60,40), c(80,20), c(50,0))
polys <- SpatialPolygons(list(Polygons(list(Polygon(region1)), "region1"),
Polygons(list(Polygon(region2)), "region2")))
r <- raster(ncol=1000, nrow=1000)
r[] <- runif(ncell(r),0,1)
extent(r) <- matrix(c(0, 0, 1000, 1000), nrow=2)
r_crop <- crop(r, extent(polys), snap="out", progress='text')
r_mask <- mask(r_crop, polys)
plot(r_mask)
plot(polys, add=TRUE)
x <- sampleRandom(r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)
>[1] 2000

Example 2: The next example is with real data that consist of a universal raster (geo.r) of 2e^8 cells and a subset of spatial polygons (geo.poly) that contain 1200 polygons and is a smaller extent than geo.r. This code incorrectly results in a matrix of much less than n; depending on the random sample. a few runs produce a matrix of between 3 and 117 non-NA coordinate pairs.
require(maptools)
Prj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
modeling_areas_SHP <- "C:/.../modeling_areas_dissolve.shp"
geo.polys <- readShapePoly(modeling_areas_SHP, IDvar="area_ID", proj4string=CRS(Prj))
geo.poly <- modeling_areas[modeling_areas$area_ID == i,] #subset the shapefile
geo.r <- raster("C:/.../cost_raster")
geo.r_crop <- crop(geo.r, extent(geo.poly), snap="out", progress='text')
geo.r_mask <- mask(geo.r_crop, geo.poly, progress='text')
plot(geo.r_mask)
plot(geo.poly, add=TRUE)
x <- sampleRandom(geo.r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)
>[1] 117

To me at least, the above examples are the same except for the overall size of the rasters and complexity of the polygons; two very important factors. Obviously I cannot provide the real world data because of file sizes, but I hope the code will suffice.
How do I fix this?
I used this work around hack, but it is not super efficient. However, was more efficient than using spsample() from the "sp" package.
micro_sample = 50000
tmp_rand_smple <- data.frame(x = numeric(0), y = numeric(0), layer = numeric(0))
while(nrow(tmp_rand_smple) < micro_sample){
tmp_smple <- data.frame(sampleRandom(geo.r_mask,10000, na.rm=TRUE, xy=TRUE)) ### 10k is an arbitrary chunk, loops until > micro_sample
tmp_rand_smple <- rbind(tmp_rand_smple, tmp_smple)
tmp_rand_smple <- unique(tmp_rand_smple[c("x", "y", "layer")]) # remove any duplicate coordinate pairs
}
tmp_rand_smple <- tmp_rand_smple[1:micro_sample,] # trim to length of micro_sample
Example 3: Here is an example of the above code that can be reproduced with the linked shapefile. On my computer this fails at returning the required number of random sample https://www.dropbox.com/s/7poaqcxju808arw/riverine_region_1.zip
Prj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
geo.poly <- readShapePoly("FILE LOCATION:/riverine_region_1", IDvar="area_ID", proj4string=CRS(Prj)) ## set file location
r <- raster(ncol=5202, nrow=8182)
r[] <- runif(ncell(r),0,1)
extent(r) <- matrix(c( 1533500, 592219.7, 1447689, 537662.6), nrow=2)
r_crop <- crop(r, extent(geo.poly), snap="out", progress='text')
r_mask <- mask(r_crop, geo.poly, progress='text')
plot(r_mask)
plot(geo.poly, add=TRUE)
x <- sampleRandom(r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)
sampleRandom(..., 2000, na.rm=TRUE, xy=TRUE)produces a 2000 by 3 matrix with no NA values in it. (Rversion 2.15.2.) – whuber Dec 19 '13 at 16:20sampleRandom()has a rasterncell()size limit, or a NA to non-NA value ratio limit, after whichna.rm=TRUEonly returns the list of cells with value and does not dig back in to replace the NA values it sampled. Hence returning a matrix of 117 by 3 with no NA values whensampleRandom(geo.r_mask,2000, na.rm=TRUE, xy=TRUE)is called. geo.r_mask being a5202, 8182, 42562764 (nrow, ncol, ncell)raster with only 637506 cells with non-NA values. – Mr.ecos Dec 19 '13 at 16:46