For loop over columns in R

Question

I would like to iterate over columns in dataframe and for every column if the number of NAs is bigger than 50% of all entries I would like to remove that column from the dataframe. So far I have something like this but it doesn't work:

for (i in names(df_r)) {
    if (sum(is.na(df_r[,i]))/length(df_r) > 0.5) {
        df_r <- df_r[, -i]
        }
    }

I am more of a python guy and I am learning R so I might be mixing syntax here.

also: please see how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610); that makes it a lot easier for other to answer — Jaap, Feb 27 '18 at 10:18

score 3 · Answer 1 · answered Feb 27 '18 at 10:21

3

For loops in R are generally not very fast and should be avoided. In this case, you can use dplyr to make it fast and tidy:

library(dplyr)

df_r %>% 
  select_if(function(x) { ! sum(is.na(x)) / length(x) > 0.5 })

answered Feb 27 '18 at 10:21

clemens

6,224
2
16
27

r2evans · Accepted Answer · 2018-02-27T10:32:17.193

2

You are much better off using more vector-based calculations vice the more literal for loop.

na50 <- sapply(df_r, function(x) sum(is.na(x))) / nrow(df_r)
df_r[na50 > 0.5] <- NULL
# or
df_r <- df_r[na50 <= 0.5]

edited Feb 27 '18 at 10:32

answered Feb 27 '18 at 10:14

r2evans

108,754
5
72
122

Hmm i modified your solution to: na 0.5) df_r2 – Blazej Kowalski Feb 27 '18 at 10:40

score 2 · Answer 3 · answered Feb 27 '18 at 10:17

2

I would use lapply to loop over the data.frame columns:

DF <- data.frame(x = c(1, NA, 2), y = c("a", NA, NA))
DF[] <- lapply(DF, function(x) if (mean(is.na(x)) <= 0.5) x else NULL)
#   x
#1  1
#2 NA
#3  2

answered Feb 27 '18 at 10:17

Roland

122,144
10
182
276

score 0 · Answer 4 · answered Feb 27 '18 at 10:14

Check this:

## for loop solution
for(i in names(dt))
{
    len <- nrow(dt)
    if(sum(is.na(dt[[i]])) > (len/2)) dt[[i]] <- NULL
    else next
}

## non for loop solution
cols <- colSums(is.na(dt))
cols <- names(cols[cols > (nrow(dt)/2)])
dt[[cols]] <- NULL

score 0 · Answer 5 · answered Feb 27 '18 at 10:18

It's basically one line:

df_r <- df_r[, apply(df_r, MARGIN = 2, FUN = function(x) sum(is.na(x))/length(x) <= 0.5)]

apply applies the function (specified after FUN =) to each column (specified by MARGIN = 2). The function checks whether the proportion of NAs is bigger smaller or equal to 0.5 and returns a logical vector. This vector then selects only the columns of df_r which have the small NA proportion.

For loop over columns in R

5 Answers5