0

When I used to remove columns, I would always do something like:

DT[, Tax:=NULL]

Sometimes to make a backup, I would do something like

DT2 <- DT

But just a second ago this happened:

library(data.table)
DT <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000, 
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001, 
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df", 
"tbl", "data.frame"))

DT2 <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000, 
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001, 
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df", 
"tbl", "data.frame"))

setDT(DT) 
setDT(DT2)
DT2 <- DT

# Removes Tax in BOTH datasets !!
DT2[, Tax:=NULL]

I remember something about this when starting to learn about data.table, but obviously this is not really desirable (for me at least).

What is the proper way to deal with this without accidentally deleting columns?

Tom
  • 1,175
  • 1
  • 14
  • 33

1 Answers1

2

(Moved from comments.)

Since data.table uses referential semantics (in-place, not copy-on-write like most of R), then your assignment DT2 <- DT means that both variables point to the same data. This is one of the gotchas with "memory-efficient operations" that rely on in-place work: if you goof, you lose it. Any way that will protect you against this kind of mistake will be memory-inefficient, keeping one (or more) copies of data sitting around.

If you need DT2 to be a different dataset, then use

DT2 <- copy(DT)

after which DT2[,Tax:=NULL] will not affect DT.

I find MattDowle's answer here to be informative/helpful here (though the question explicitly asked about copy, not just the behavior you mentioned).

r2evans
  • 108,754
  • 5
  • 72
  • 122