Accidental column deletion - What is the proper way of creating data.table backups and deleting data.table columns

Question

When I used to remove columns, I would always do something like:

DT[, Tax:=NULL]

Sometimes to make a backup, I would do something like

DT2 <- DT

But just a second ago this happened:

library(data.table)
DT <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000, 
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001, 
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df", 
"tbl", "data.frame"))

DT2 <- structure(list(Province = c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 
3), Tax = c(2000, 3000, 1500, 3200, 2000, 1500, 4000, 2000, 2000, 
1000, 2000, 1500), year = c(2000, 2000, 2000, 2001, 2001, 2001, 
2002, 2002, 2002, 2003, 2003, 2003)), row.names = c(NA, -12L), class = c("tbl_df", 
"tbl", "data.frame"))

setDT(DT) 
setDT(DT2)
DT2 <- DT

# Removes Tax in BOTH datasets !!
DT2[, Tax:=NULL]

I remember something about this when starting to learn about data.table, but obviously this is not really desirable (for me at least).

What is the proper way to deal with this without accidentally deleting columns?

Since `data.table` uses referential semantics (*in-place*, not copy-on-write like most of R), then your assignment `DT2 — r2evans, Dec 03 '20 at 14:01
You can find your answer here: https://stackoverflow.com/a/10226454/3768871 — OmG, Dec 03 '20 at 14:03
You can also read FAQ. @r2evans please make an answer from your comment. — jangorecki, Dec 03 '20 at 18:30

score 2 · Accepted Answer · answered Dec 03 '20 at 18:57

(Moved from comments.)

Since data.table uses referential semantics (in-place, not copy-on-write like most of R), then your assignment DT2 <- DT means that both variables point to the same data. This is one of the gotchas with "memory-efficient operations" that rely on in-place work: if you goof, you lose it. Any way that will protect you against this kind of mistake will be memory-inefficient, keeping one (or more) copies of data sitting around.

If you need DT2 to be a different dataset, then use

DT2 <- copy(DT)

after which DT2[,Tax:=NULL] will not affect DT.

I find MattDowle's answer here to be informative/helpful here (though the question explicitly asked about copy, not just the behavior you mentioned).

Thank you very much. So weird that it took so long for it to go wrong. — Tom, Dec 04 '20 at 08:00

Accidental column deletion - What is the proper way of creating data.table backups and deleting data.table columns

1 Answers1