0

I am new to R and am having issues trying to work with a large dataset. I have a variable called DifferenceMonths and I would like to create a subset of my large dataset with only observations where the variable DifferenceMonths is less than 3.

It is coded into R as a factor so I have tried multiple times to convert it to a numeric. It finally showed up as numeric in my Global Environment, but then I checked using str() and it still shows up as a factor variable.

Log:

DifferenceMonths<-as.numeric(levels(DifferenceMonths))[DifferenceMonths]

Warning message:
NAs introduced by coercion 

KRASDiff<-subset(KRASMCCDataset_final,DifferenceMonths<=2)

Warning message:
In Ops.factor(DifferenceMonths, 2) : ‘<=’ not meaningful for factors

str(KRASMCCDataset_final)

'data.frame':   7831 obs. of  25 variables:
 $ Age                : Factor w/ 69 levels "","21","24","25",..: 29 29 29 29 29 29 29 29 29 29 ...
 $ Alive.Dead         : Factor w/ 4 levels "","A","D","S": 2 2 2 2 2 2 2 2 2 2 ...
 $ Status             : Factor w/ 5 levels "","ambiguous",..: 4 4 5 5 4 5 5 5 4 5 ...
 $ DifferenceMonths   : Factor w/ 75 levels "","#NUM!","0",..: 14 14 14 14 14 14 14 14 14 14 ...

Thank you!

divibisan
  • 10,372
  • 11
  • 36
  • 56
  • 1
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. A `str()` isn't as helpful as a `dput()` for testing possible solutions. It also looks like you clearly have values that are not numeric in there like "#NUM!" – MrFlick Mar 28 '18 at 19:21

1 Answers1

1

It's ugly, but you want:

as.numeric(as.character(DifferenceMonths))

The problem here, which you may have discovered, is that as.numeric() gives you the internal integer codes for the factor. The values are stored in the levels. But if you run as.numeric(levels(DifferenceMonths)), you'll get those values, but just as they appear in levels(DifferenceMonths). The way around this is to coerce to character first, and get away from the internal integer codes all together.

EDIT: I learned something today. See this answer

as.numeric(levels(DifferenceMonths))[DifferenceMonths]

Is the more efficient and preferred way, in particular if length(levels(DifferenceMonths)) is less than length(DifferenceMonths).

EDIT 2: on review after @MrFlick's comment, and some initial testing, x <- as.numeric(levels(x))[x] can behave strangely. Try assigning it to a new variable name. Let me see if I can figure out how and when this behavior occurs.

De Novo
  • 6,260
  • 21
  • 37