1

I have been using data frames in R for quite some time. I feel that I have a pretty good handle on what they can and can't do. However, I have recently become interested in data tables due to much more efficient lookups. However, I have run into a bit of an issue right out of the gate.

Typically with a data frame I will assign rownames and use those for indexing later. The nice thing about doing this is the rownames need not be a column in the data. So suppose I read in a csv file of the form:

Name, val1, val2, …, valN

where Name is a (unique) string and the vals are numbers. Then I will set rownames(x) = x[,1] and remove the first column. Now I have an entirely numeric data frame that I can add, subtract, etc. I don’t have to be concerned with doing math operations on string fields. Now I could do something like apply(x, 1, mean) with no problems.

However, it seems that in data table world I would do something like this:

DT = as.data.table(x); setkey(DT, Name)

But now the character column sticks around. So suppose I want to take an average of each row. Do I now have to constantly tell it to only act on columns 2:ncol?

I assume there is a way around this, but my googling has come up empty.

Konrad Rudolph
  • 506,650
  • 124
  • 909
  • 1,183
FSU79
  • 11
  • 2
  • 3
    In your first instance, you should probably be working with matrices rather than data.frames. `apply` converts a data.frame to a matrix anyway, and so this would avoid an additional copy. Also, you can take advantage of the super fast matrix operations like `crossprod` and `rowSums`. – lmo Aug 04 '17 at 13:30
  • 3
    If you are that concerned with row-wise operations, I'd say you probably shouldn't use either of these data structures. The appropriate data structure would appear to be a matrix. data.frames and data.tables are optimized for columnwise operations. – Roland Aug 04 '17 at 13:31
  • Obviously, you could also melt the data.table and use the "by" syntax. – Roland Aug 04 '17 at 13:32
  • These are fair comments. My example was contrived to illustrate a point. In reality I spend a lot of time filtering data and thought data.table could help with this. Here is a more realistic example of what I might do. Suppose X is a structure of two columns, both strings. Y is a structure like I said above - String, Num, Num, Num. Suppose for each pair of strings in X, I want to compute the vector difference. The way I would do this is Y[X[,1],] - Y[X[,2],]. This works fine using data.frames and setting rownames on Y. Is there a better way to achieve this? – FSU79 Aug 04 '17 at 13:37
  • Re your new example in the comments, if you build a real example with reproducible input and desired output, that might be easier to answer. Generally, this site is for concrete programming problems. See https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 – Frank Aug 04 '17 at 14:28

0 Answers0