How to get rid of anomalies using lapply in R

Question

I have a list of data frames and I am trying to use lapply to get rid of anomalies in my data, trying to make the code as robust as possible as the data inputted will be constantly different.

I am trying to use:

newdata <- lapply(ChaseSubSet, function(){
  anomalies <- 0.02 > ChaseSubSet[,1] > 0.03
  anomalies = na
})

However a) this doesn't work and b) I'm thinking it would be more robust to get rid of values more than 0.1 away from the mean. I would have to apply different rules to each column of the data but have it apply through all the data.frames in the list. I want to use lapply to result in a list at the end.

My data is as follows:

I would like to sort through all 13 dataframes of the list which are all like this image. I would like if there are anomalous values for value to be replaced with NA my thinking is this will create the least issues later on with different columns of different length.

I am still very new so I apologise if any of this is incorrect.

lapply(ChaseSubSet, function(x){subset(x,x[,1] - mean(x[,1]) < 0.1)}) — StupidWolf, Feb 14 '20 at 22:07
something like this? Hey not very sure what ChaseSubSet looks like, and also how you wanna filter — StupidWolf, Feb 14 '20 at 22:07
It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. — MrFlick, Feb 14 '20 at 22:08
Thank you @StupidWolf can I ask a question, what part of that code will get rid of the values that meet thoes requirements is that what the subset function does ? Will it replace them with na? — Natasha Jones, Feb 14 '20 at 22:35
The code I have above, doesn't replace them with NAs. Can you provide more details, like what is the data and which columns do you want to replace with NAs? — StupidWolf, Feb 14 '20 at 22:37
Is `ChaseSubSet` a list of dataframes or one of the dataframes in that list? — Rui Barradas, Feb 14 '20 at 22:56
@StupidWolf I updated the question with some more information you asked for — Natasha Jones, Feb 14 '20 at 23:07

score 3 · Answer 1 · answered Feb 14 '20 at 23:10

3

If the list of dataframes is ChaseSubSet, call the lapply below the function no_anomalies. Note the argument offset that you can set if you want to remove anomalies (outliers?) more than 0.1 away from the mean value of each vector in the dataframes.

no_anomalies <- function(x, offset = 0.1, na.rm = TRUE){
  x.bar <- mean(x, na.rm = na.rm)
  away <- x < (x.bar - offset) | x > (x.bar + offset)
  is.na(x) <- which(away)
  x
}

newdata <- lapply(ChaseSubSet, function(DF){
  is_num <- sapply(DF, is.numeric)
  DF[is_num] <- lapply(DF[is_num], no_anomalies, offset = 0.1)
  DF
})

answered Feb 14 '20 at 23:10

Rui Barradas

57,195
8
29
57

Thank you @RuiBarradas this is very helpful. Just two questions if you have time: – Natasha Jones Feb 14 '20 at 23:31
1) The first column is staying as not numeric despite this line in your code as it is in E base – Natasha Jones Feb 14 '20 at 23:33
2) This code is deleting an entire column if one of the numbers is off instead of deleting just that number – Natasha Jones Feb 14 '20 at 23:34
1) Are you sure that column is numeric? What does `class(df$thatcol)` return? See [this SO post](https://stackoverflow.com/questions/44725001/convert-scientific-notation-to-numeric-preserving-decimals) and [this other SO post](https://stackoverflow.com/questions/40655701/scientific-notation-wrongly-converts-to-numbers). – Rui Barradas Feb 15 '20 at 06:23
1

2) Maybe you should use another criterion to decide what *is off*. Outliers removal is not easy because the concept of outliers is very, very vague. There is a package [outliers](https://CRAN.R-project.org/package=outliers) that tries to identify outliers. Maybe you should take a look and definitely rethink what you are trying to do. – Rui Barradas Feb 15 '20 at 06:27

How to get rid of anomalies using lapply in R

1 Answers1