0

I have a fairly large data frame that has multiple "-" which represent missing data. The data frame consisted of multiple Excel files, which could not use the "na.strings =" or alternative function, so I had to import them with the "-" representation.

How can I replace all "-" in the data frame with NA / missing values? The data frame consists of 200 columns of characters, factors, and integers.

So far I have tried:

sum(df %in c("-"))
returns: [1] 0

df[df=="-"] <-NA #does not do anything

library(plyr)
df <- revalue(df, c("-",NA))
returns: Error in revalue(tmp, c("-", NA)) : 
  x is not a factor or a character vector.

library(anchors)
df <- replace.value(df,colnames(df),"-",as.character(NA))
Error in charToDate(x) : 
  character string is not in a standard unambiguous format

The data frame consists of 200 columns of characters, factors, and integers, so I can see why the last two do not work correctly. Any help would be appreciated.

Starbucks
  • 1,335
  • 2
  • 18
  • 35
  • 2
    Can you post a subset of your data as a sample so that this is reproducible on your data? Thanks :) – mysteRious Mar 23 '18 at 20:36
  • How many rows does your data have? – bala83 Mar 23 '18 at 20:41
  • ≈500,000 rows.. – Starbucks Mar 23 '18 at 20:42
  • I think your best option would be to clean up the data row-wise and writing it out to a file. Then you can read your cleaned up file :) – bala83 Mar 23 '18 at 20:43
  • I've tried that. When I re-import the data (the cleaned version with NA's), it turns all cleaned columns in to logical classes... :-/ – Starbucks Mar 23 '18 at 20:44
  • If an entry has a "-" in it then it would have been read as a factor variable by default. You will (probably) need to learn how to convert a factor to a character and then use `is.na – IRTFM Mar 23 '18 at 20:45
  • Oh there is an option when you're reading data using `read.table()` or `scan` to prevent co-ercing by explicitly specifying colClasses and na.strings. Did you try with these options set ? – bala83 Mar 23 '18 at 20:48

3 Answers3

9

Since you're already using tidyverse functions, you can easily use na_if from dplyr within your pipes.

For example, I have a dataset where 999 is used to fill in a non-answer:

df <- tibble(
    alpha = c("a", "b", "c", "d", "e"), 
    val1 = c(1, 999, 3, 8, 999), 
    val2 = c(2, 8, 999, 1, 2))

If I wanted to change val1 so 999 is NA, I could do:

df %>% 
    mutate(val1 = na_if(val1, 999))

In your case, it sounds like you want to replace a value across multiple variables, so using mutate_at or mutate_if would be more appropriate:

df %>%
    mutate_at(vars(val1, val2), na_if, 999)

replaces all instances of 999 in both val1 and val2 with NA and now looks like this:

# A tibble: 5 x 3
  alpha  val1  val2
  <chr> <dbl> <dbl>
1 a        1.    2.
2 b       NA     8.
3 c        3.   NA 
4 d        8.    1.
5 e       NA     2.
camille
  • 15,634
  • 17
  • 33
  • 53
4

I believe the simplest solution is with base R function is.na<-. It's meant to solve precisely that issue.

First, make up some data. Then set the required values to NA.

set.seed(247)    # make the results reproducible

df <- data.frame(X = 1:10, Y = sample(c("-", letters[1:2]), 10, TRUE))

is.na(df) <- df == "-"
df
#    X    Y
#1   1    a
#2   2    b
#3   3    b
#4   4    a
#5   5 <NA>
#6   6    b
#7   7    a
#8   8 <NA>
#9   9    b
#10 10    a
Rui Barradas
  • 57,195
  • 8
  • 29
  • 57
1

Here's a solution that will do it:

> library(dplyr)
> test <- tibble(x = c('100', '20.56', '0.003', '-', '  -'),  y = 5:1)
> makeNA <- function(x) str_replace(x,'-',NA_character_)
> mutate_all(test, funs(makeNA))
# A tibble: 5 x 2
  x     y    
  <chr> <chr>
1 100   5    
2 20.56 4    
3 0.003 3    
4 NA    2    
5 NA    1  
mysteRious
  • 3,840
  • 2
  • 13
  • 29