-1

I have a list of data frames and I am trying to use lapply to get rid of anomalies in my data, trying to make the code as robust as possible as the data inputted will be constantly different.

I am trying to use:

newdata <- lapply(ChaseSubSet, function(){
  anomalies <- 0.02 > ChaseSubSet[,1] > 0.03
  anomalies = na
})

However a) this doesn't work and b) I'm thinking it would be more robust to get rid of values more than 0.1 away from the mean. I would have to apply different rules to each column of the data but have it apply through all the data.frames in the list. I want to use lapply to result in a list at the end.

My data is as follows:

data

I would like to sort through all 13 dataframes of the list which are all like this image. I would like if there are anomalous values for value to be replaced with NA my thinking is this will create the least issues later on with different columns of different length.

I am still very new so I apologise if any of this is incorrect.

Celso Wellington
  • 722
  • 6
  • 16

1 Answers1

3

If the list of dataframes is ChaseSubSet, call the lapply below the function no_anomalies. Note the argument offset that you can set if you want to remove anomalies (outliers?) more than 0.1 away from the mean value of each vector in the dataframes.

no_anomalies <- function(x, offset = 0.1, na.rm = TRUE){
  x.bar <- mean(x, na.rm = na.rm)
  away <- x < (x.bar - offset) | x > (x.bar + offset)
  is.na(x) <- which(away)
  x
}

newdata <- lapply(ChaseSubSet, function(DF){
  is_num <- sapply(DF, is.numeric)
  DF[is_num] <- lapply(DF[is_num], no_anomalies, offset = 0.1)
  DF
})
Rui Barradas
  • 57,195
  • 8
  • 29
  • 57
  • Thank you @RuiBarradas this is very helpful. Just two questions if you have time: – Natasha Jones Feb 14 '20 at 23:31
  • 1) The first column is staying as not numeric despite this line in your code as it is in E base – Natasha Jones Feb 14 '20 at 23:33
  • 2) This code is deleting an entire column if one of the numbers is off instead of deleting just that number – Natasha Jones Feb 14 '20 at 23:34
  • 1) Are you sure that column is numeric? What does `class(df$thatcol)` return? See [this SO post](https://stackoverflow.com/questions/44725001/convert-scientific-notation-to-numeric-preserving-decimals) and [this other SO post](https://stackoverflow.com/questions/40655701/scientific-notation-wrongly-converts-to-numbers). – Rui Barradas Feb 15 '20 at 06:23
  • 1
    2) Maybe you should use another criterion to decide what *is off*. Outliers removal is not easy because the concept of outliers is very, very vague. There is a package [outliers](https://CRAN.R-project.org/package=outliers) that tries to identify outliers. Maybe you should take a look and definitely rethink what you are trying to do. – Rui Barradas Feb 15 '20 at 06:27