0

I have a data.frame in R that consists in a number of columns with numerical values. Like this:

   A       B      C
0.6057  0.1644  6.93
0.5723  0.117   6.59
0.5614  0.1552  7.02
0.4102  0.1059  5.24
0.4945  0.0857  6.64
0.5157  0.0747  7.06
0.7785  0.1394  5.21
0.5492  0.1557  6.06
0.5411  0.1884  5.68
0.6622  0.148   6.1

For each of these columns, I want to create a new column containing the quartile values. I have no problem doing it over one column at a time using this formula:

tableOne <- within(data, quartile <-
                    as.integer(cut(A, quantile(A, probs=0:5/5,na.rm=T))))

But as I have 100 columns with different names, I wanted to loop over each column separately.

I tried a loop without success:

for(i in names(data)){
  tableOne <- within(data, quarti <- as.integer(cut(i, quantile(i, probs=0:5/5,na.rm=T))))
}

I get the following error:

Error in cut.default(i, quantile(i, probs = 0:5/5, na.rm = T)) : 
  'x' must be numeric

I also tried apply function:

df.two <- lapply(df, function(x) within(data, quartile <- as.integer(cut(x, quantile(x, probs=0:5/5,na.rm=T)))))

with no success:

Error during wrapup: argument "obj" is missing, with no default
Error during wrapup: target context is not on the stack

Any advices on how to iterate my functions over all the columns and get all results in the same data.frame?

Thanks a lot

smci
  • 29,564
  • 18
  • 109
  • 144
user3188922
  • 309
  • 1
  • 3
  • 16

1 Answers1

5

See end of answer for a better approach, this one is for easy understanding of the steps.

I'm unsure what you're willing to do, but maybe this:

df2<- as.data.frame( lapply( df, function(x){
  as.integer( cut(x, quantile(x, probs=(0:5)/5, na.rm=T)))
}))
colnames(df2) <- paste0("quartile_",colnames(df))
df3 <- cbind(df,df2)

Which gives:

        A      B    C quartile_A quartile_B quartile_C
1  0.6057 0.1644 6.93          4          5          4
2  0.5723 0.1170 6.59          4          2          3
3  0.5614 0.1552 7.02          3          4          5
4  0.4102 0.1059 5.24         NA          2          1
5  0.4945 0.0857 6.64          1          1          4
6  0.5157 0.0747 7.06          2         NA          5
7  0.7785 0.1394 5.21          5          3         NA
8  0.5492 0.1557 6.06          3          4          2
9  0.5411 0.1884 5.68          2          5          2
10 0.6622 0.1480 6.10          5          3          3

Datas used:

> dput(df)
structure(list(A = c(0.6057, 0.5723, 0.5614, 0.4102, 0.4945, 
0.5157, 0.7785, 0.5492, 0.5411, 0.6622), B = c(0.1644, 0.117, 
0.1552, 0.1059, 0.0857, 0.0747, 0.1394, 0.1557, 0.1884, 0.148
), C = c(6.93, 6.59, 7.02, 5.24, 6.64, 7.06, 5.21, 6.06, 5.68, 
6.1)), .Names = c("A", "B", "C"), class = "data.frame", row.names = c(NA, 
-10L))

As per @DavidArenburg comment below a better way to achieve the same result is:

df[paste0("quartile_",colnames(df))] <- lapply(df, function(x) as.integer(cut(x, quantile(x, probs=(0:5)/5, na.rm = TRUE))))

This avoid creating a new dataframe and copying it over at end.

Tensibai
  • 15,304
  • 1
  • 37
  • 55
  • You could do this in one go using `df[paste0("quartile_",colnames(df))] – David Arenburg Dec 15 '15 at 17:58
  • @David I kept it in 3 steps to ease the understanding (and been too lazy to shorten it on 1 pass), that said you should add it as another answer for future readers ;) – Tensibai Dec 15 '15 at 18:01
  • 1
    I think you could just edit... I dont see a reason to convert to `data.frame` again, create a new object and then copy everything again using `cbind`. The above line isn't just shorter but is much more efficient I would think. – David Arenburg Dec 15 '15 at 18:03
  • @David and as it's yours, I don't wish to steal it ;) It's perfectly valid to have multiple answers, and here it's a different approach so it should stand on it's own to allow other to vote for the most useful/performant one to guide readers on their choose (In my humble opinion) – Tensibai Dec 15 '15 at 18:06