1

I would like to iterate over columns in dataframe and for every column if the number of NAs is bigger than 50% of all entries I would like to remove that column from the dataframe. So far I have something like this but it doesn't work:

for (i in names(df_r)) {
    if (sum(is.na(df_r[,i]))/length(df_r) > 0.5) {
        df_r <- df_r[, -i]
        }
    }

I am more of a python guy and I am learning R so I might be mixing syntax here.

Blazej Kowalski
  • 347
  • 1
  • 6
  • 16
  • 2
    just `df_r[colMeans(is.na(df_r)) < 0.5]` – Jaap Feb 27 '18 at 10:16
  • 2
    also: please see how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610); that makes it a lot easier for other to answer – Jaap Feb 27 '18 at 10:18

5 Answers5

3

For loops in R are generally not very fast and should be avoided. In this case, you can use dplyr to make it fast and tidy:

library(dplyr)

df_r %>% 
  select_if(function(x) { ! sum(is.na(x)) / length(x) > 0.5 })
clemens
  • 6,224
  • 2
  • 16
  • 27
2

You are much better off using more vector-based calculations vice the more literal for loop.

na50 <- sapply(df_r, function(x) sum(is.na(x))) / nrow(df_r)
df_r[na50 > 0.5] <- NULL
# or
df_r <- df_r[na50 <= 0.5]
r2evans
  • 108,754
  • 5
  • 72
  • 122
2

I would use lapply to loop over the data.frame columns:

DF <- data.frame(x = c(1, NA, 2), y = c("a", NA, NA))
DF[] <- lapply(DF, function(x) if (mean(is.na(x)) <= 0.5) x else NULL)
#   x
#1  1
#2 NA
#3  2
Roland
  • 122,144
  • 10
  • 182
  • 276
0

Check this:

## for loop solution
for(i in names(dt))
{
    len <- nrow(dt)
    if(sum(is.na(dt[[i]])) > (len/2)) dt[[i]] <- NULL
    else next
}

## non for loop solution
cols <- colSums(is.na(dt))
cols <- names(cols[cols > (nrow(dt)/2)])
dt[[cols]] <- NULL
YOLO
  • 18,072
  • 3
  • 18
  • 39
0

It's basically one line:

df_r <- df_r[, apply(df_r, MARGIN = 2, FUN = function(x) sum(is.na(x))/length(x) <= 0.5)]

apply applies the function (specified after FUN =) to each column (specified by MARGIN = 2). The function checks whether the proportion of NAs is bigger smaller or equal to 0.5 and returns a logical vector. This vector then selects only the columns of df_r which have the small NA proportion.

kath
  • 7,399
  • 16
  • 31