Using sampleRandom() from large raster without NA values in R?

Question

I have a number of vary large rasters which need to be randomly sampled with the return value being a matrix of x, y, and value. The raster package sampleRandom(raster,n, na.rm=TRUE, xy=TRUE) will do this just fine most of the time. When working correctly, this function returns a matrix of non-NA values for n coordinate pairs. When NA values come up in the sample, they are dropped and replaced by a non-NA value.

However, for my rasters (smallest being 4e^7 cells and some having a high percent of NA values), sampleRandom() returns a matrix substantially smaller than n coordinate pairs. Presumably, this due to sampled NA values, not being replaced when they are sampled.

Why does the sampleRandom function return incomplete results on the real-world data example?

Ss @Radar correctly pointed out, the documentation for the raster packages states: With argument na.rm=TRUE, the returned sample may be smaller than requested

With this, my question becomes; how do I draw a work around this and efficiently draw random sample of n coordinate pairs?

Example 1: this works correctly in retrieving a random sample of n from a larger raster that is cropped and masked by spatial polygons. returns a matrix of 2000 cordiante points.

region1 <- rbind(c(0,0), c(50,0), c(50,50), c(20,20), c(0,0))
region2 <- rbind(c(50,0), c(80,0), c(100,50), c(60,40), c(80,20), c(50,0))
polys <- SpatialPolygons(list(Polygons(list(Polygon(region1)), "region1"),
                        Polygons(list(Polygon(region2)), "region2")))

r <- raster(ncol=1000, nrow=1000)
r[] <- runif(ncell(r),0,1)
extent(r) <- matrix(c(0, 0, 1000, 1000), nrow=2)

r_crop <- crop(r, extent(polys), snap="out", progress='text')
r_mask <- mask(r_crop, polys) 

plot(r_mask)
plot(polys, add=TRUE)

x <- sampleRandom(r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)

>[1] 2000

results of crop and mask for sample data

Example 2: The next example is with real data that consist of a universal raster (geo.r) of 2e^8 cells and a subset of spatial polygons (geo.poly) that contain 1200 polygons and is a smaller extent than geo.r. This code incorrectly results in a matrix of much less than n; depending on the random sample. a few runs produce a matrix of between 3 and 117 non-NA coordinate pairs.

require(maptools)
Prj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
modeling_areas_SHP <- "C:/.../modeling_areas_dissolve.shp"
geo.polys <- readShapePoly(modeling_areas_SHP, IDvar="area_ID", proj4string=CRS(Prj))
geo.poly <- modeling_areas[modeling_areas$area_ID == i,] #subset the shapefile

geo.r <- raster("C:/.../cost_raster")

geo.r_crop <- crop(geo.r, extent(geo.poly), snap="out", progress='text')
geo.r_mask <- mask(geo.r_crop, geo.poly, progress='text')

plot(geo.r_mask)
plot(geo.poly, add=TRUE)

x <- sampleRandom(geo.r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)

>[1] 117

results of crop and mask for real data

To me at least, the above examples are the same except for the overall size of the rasters and complexity of the polygons; two very important factors. Obviously I cannot provide the real world data because of file sizes, but I hope the code will suffice.

How do I fix this?

I used this work around hack, but it is not super efficient. However, was more efficient than using spsample() from the "sp" package.

micro_sample = 50000
tmp_rand_smple <- data.frame(x = numeric(0), y = numeric(0), layer = numeric(0))
            while(nrow(tmp_rand_smple) < micro_sample){
                tmp_smple <- data.frame(sampleRandom(geo.r_mask,10000, na.rm=TRUE, xy=TRUE)) ### 10k is an arbitrary chunk, loops until > micro_sample
                tmp_rand_smple <- rbind(tmp_rand_smple, tmp_smple)
                tmp_rand_smple <- unique(tmp_rand_smple[c("x", "y", "layer")]) # remove any duplicate coordinate pairs
            }
            tmp_rand_smple <- tmp_rand_smple[1:micro_sample,] # trim to length of micro_sample

Example 3: Here is an example of the above code that can be reproduced with the linked shapefile. On my computer this fails at returning the required number of random sample https://www.dropbox.com/s/7poaqcxju808arw/riverine_region_1.zip

Prj <- "+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
geo.poly <- readShapePoly("FILE LOCATION:/riverine_region_1", IDvar="area_ID", proj4string=CRS(Prj))  ## set file location

r <- raster(ncol=5202, nrow=8182)
r[] <- runif(ncell(r),0,1)
extent(r) <- matrix(c( 1533500, 592219.7, 1447689, 537662.6), nrow=2)

r_crop <- crop(r, extent(geo.poly), snap="out", progress='text')
r_mask <- mask(r_crop, geo.poly, progress='text') 

plot(r_mask)
plot(geo.poly, add=TRUE)

x <- sampleRandom(r_mask,2000, na.rm=TRUE, xy=TRUE)
nrow(x)

I fail to see your question? So you want to reduce the complexity of randomly sampled subsets? — Curlew, Dec 18 '13 at 18:09
My question is: Why is the real world code example returning a matrix of only 3 coordinate pairs, when it is supposed to return a random sample of 2000. In example 1, it works correctly, but it does not work correctly in example 2. — Mr.ecos, Dec 18 '13 at 18:11
Using simulated data in which a 10,000 by 10,000 raster contains just 10,000 non-NA cells, I could not reproduce this problem: sampleRandom(..., 2000, na.rm=TRUE, xy=TRUE) produces a 2000 by 3 matrix with no NA values in it. (R version 2.15.2.) — whuber, Dec 19 '13 at 16:20
Thank you for your observation whuber. You are correct; the simulated data above is to demonstrate how it is supposed to work. Because of file size, I cannot provide the data set for which it does not work. However, I provide the code to show that it is the same method as the working simulated data example. The image at the bottom shows the complexity of the shapefile mask. Everything within the mask has a raster value. The NA values are coming from the region beyond the shapefile mask edges, but within the raster extent. — Mr.ecos, Dec 19 '13 at 16:40
It appears sampleRandom() has a raster ncell() size limit, or a NA to non-NA value ratio limit, after which na.rm=TRUE only returns the list of cells with value and does not dig back in to replace the NA values it sampled. Hence returning a matrix of 117 by 3 with no NA values when sampleRandom(geo.r_mask,2000, na.rm=TRUE, xy=TRUE) is called. geo.r_mask being a 5202, 8182, 42562764 (nrow, ncol, ncell) raster with only 637506 cells with non-NA values. — Mr.ecos, Dec 19 '13 at 16:46
I used a raster almost 2.5 times larger than yours but could not reproduce the problem, so it does not look like a cell size limit is the culprit. It would be helpful if you could post code people could use to reproduce the error rather than relying on a private dataset. — whuber, Dec 19 '13 at 17:06
Whuber, I appreciate your willingness to help with this problem. I have edited my original post to include a third example that when run on my machine recreates the erroneous results. — Mr.ecos, Dec 19 '13 at 17:43

score 5 · Answer 1 · edited Aug 31 '19 at 11:32

5

It looks like this is an artifact of the sampleRandom package you are using.

If you check the documentation, it states that:

With argument na.rm=TRUE, the returned sample may be smaller than requested

Random sampling of raster using R? might provide you with an alternative way to perform this analysis.

edited Aug 31 '19 at 11:32

PolyGeo

65,136
29
109
338

answered Dec 18 '13 at 18:15

Radar

10,659
9
60
113

Jeffrey Evans · Accepted Answer · 2013-12-19T21:07:40.847

Just pad your desired number of random samples and then sample back down to the correct n. This should account for the occasional NA that are produced and subsequently removed with the na.rm=TRUE argument.

    require(raster)
    # Create example data
    r1 <- raster(ncols=500, nrows=500, xmn=0)
      r1[] <- runif(ncell(r1))
    r2 <- raster(ncols=500, nrows=500, xmn=0)
      r2[] <- runif(ncell(r2))  
    r <- stack(r1,r2)

    # Sample size
    n=50

    # Random sample of raster  
    r.samp <- sampleRandom(r, size=(n+20), na.rm=TRUE, sp=FALSE, asRaster=FALSE) 
      dim( r.samp )[1]

   # Create a random sample of n size to subset r.samp
   #   (works with dataframe, matrix and sp objects)
   r.samp <- r.samp[sample( 1:dim(r.samp)[1], n),]
    dim ( r.samp )[1]

If you can read the raster into memory an approach in sp would be to use rgdal to create a SpatialGridDataFrame the coerce it into a SpatialPointsDataFrame so you can easily remove NA's and end up with a point object of your subsample. You can then sample subsequent rasters using this sp point object. The @data dataframe can be extract and coerced into a matrix for your purposes.

require(sp)
require(rgdal)
require(raster)

n=50 # Number of random samples

# Read raster data using rgdal, results in SpatialGridDataFrame 
r <- readGDAL(system.file("external/test.ag", package="sp")[1])
  class(r)
    spplot(r, "band1")

# Coerce into SpatialPointsDataFrame    
r <- as(r, "SpatialPointsDataFrame")      

# remove NA's   
r@data <- na.omit(r.pts@data)
  plot(r, pch=20)

# Create random sample. Object is a SpatialPointsDataFrame     
r.samp <- r[sample(1:dim(r)[1], n),]
  plot(r.samp, pch=20, col="red", add=TRUE)   
    class(r.samp)

#  Use r.samp sp object for additional sampling 
#    Add extra column and coerce to raster stack
r2 <- readGDAL(system.file("external/test.ag", package="sp")[1])
  r2@data <- data.frame(r2@data, band2=runif(dim(r2)[1]) ) 
    r2 <- stack(r2)

# Extract raster values using r.samp object
r.samp@data <- data.frame(r.samp@data, band2=extract(r2[[2]], r.samp))
  str(r.samp@data)

Thank you for the idea Jeffery. One issue is that depending on the morphology of the sampling area (e.g. broad uplands vs. sinuous floodplains), the ratio of NA to non-NA values can vary greatly. Because of that, I cannot even guesstimate how much to pad it by. Some rasters only returned samples of only 3 cells with value out of 100 sampled; other rasters had better returns, but still not great. The approach I edited into the end of my original question works to build the sample from the ground up, as opposed to top down. For rasters with a better NA vs. non-NA return, i'd go with yours. — Mr.ecos, Dec 19 '13 at 15:55
There is not necessarily a "sensitivity" to how much you pad n. If you want n=50 you could sample 1000 and subsample from that. As long as you are not exceeding the population n can be whatever you want. If you can actually fit a given raster into memory then treating it as an sp object would be more efficient because you can remove NA's and perform a row sample. Let me know if you want more detail on this approach. — Jeffrey Evans, Dec 19 '13 at 16:12
Thanks again Jeffrey. I have plenty of ram, so loading it into memory is probably doable. That is the approach I took to sampling values from the raster; when X,Y coord were not important. I would appreciate more details on the sp object approach. — Mr.ecos, Dec 19 '13 at 17:58
@Jeffery, I finally got back around to this. I implemented your raster-in-memory code here, but not with complete success. At first I did such with using raster() instead of readGDAL(). everything worked fine until I tried to take the same r.samp from two different rasters. Each have the same origin, dims, cell size, just different data. The returned samples were different by ~1 cell on the x-coord; so they were unpaired. I then tried it with readGDAL() but it got hung up with no success (64 Gb of ram). I I can add to my OP with code it that would help. Thanks — Mr.ecos, Jan 28 '14 at 15:47

score 2 · Answer 3 · answered Dec 19 '13 at 22:27

I was able to reproduce the problem in the third example.

A workaround uses built-in procedures. There are several options, but one convenient method is just to select each cell in the grid uniformly at random and independently with a probability large enough to assure at least n=2000 (or whatever) non-null cells will be selected, but not much more than that. That can be accomplished by computing the standard deviation of the proportion of all cells that will be selected (which has a Binomial distribution) and adding a small multiple of that standard deviation to the desired proportion. A multiple around 6 virtually guarantees at least n cells will be selected. In the example code below, 2020 cells were selected where 2000 were needed.

This method is a little inefficient compared to repeating the built-in sampleRandom procedure. Unlike the latter, though, this method samples without replacement.

This code continues in the context of Example 3 of the question: it uses the r_mask grid for input and requires the raster library to be loaded in order to use getValues.

set.seed(17)
n.sample <- 2000 # Number of non-null cells to sample
system.time({
  m <- dim(r_mask)[1]
  n <- dim(r_mask)[2]
  k <- sum(!is.na(getValues(r_mask))) # Number of non-null cells
  p <- n.sample / k                   # Proportion of them to be sampled
  pp <- p + 6*sqrt(p*(1-p)/(m*n))     # Proportion to request

  z <- matrix(runif(m*n) < pp, nrow=m)# Indicator of cells to select
  x <- unlist(apply(z, 2, function(col) (1:m)[col]))    # X-coordinates
  y <- unlist(apply(t(z), 2, function(row) (1:n)[row])) # Y-coordinates
  z <- getValues(r_mask)[z]           # Values
  i <- !is.na(z)                      # Indicator of non-null values
  a <- cbind(x[i], y[i], z[i])        # Result (with too many rows)
  print(dim(a))
  a <- a[1:n.sample, ]                # Remove any unneeded rows
})

thank you very much for thinking out this approach. I agree that the aspect of sampling without replacement will be very useful and cut out some waste; this approach executes very quickly. However, I must be misunderstanding part of the code. When I execute this on my raster with real-world coordinates, the matrix a contains coordinates from z, but not r_mask. I see that z <- getValues(r_mask)[z] acquires the correct values based on cell number from runif(m*n), but a <- cbind(x[i], y[i], z[i]) is binding row/column number to those values. Thank you again for helping. — Mr.ecos, Dec 20 '13 at 20:37
It's straightforward and fast to convert row and column coordinates to world coordinates. Please feel free to modify the code accordingly. (The coordinates are computed when x and y are first created.) — whuber, Dec 20 '13 at 21:37

Using sampleRandom() from large raster without NA values in R?

3 Answers3

Linked