25

I frequently need to recode some (not all!) values in a data frame column based off of a look-up table. I'm not satisfied by the ways I know of to solve the problem. I'd like to be able to do it in a clear, stable, and efficient way. Before I write my own function, I'd want to make sure I'm not duplicating something standard that's already out there.

## Toy example
data = data.frame(
  id = 1:7,
  x = c("A", "A", "B", "C", "D", "AA", ".")
)

lookup = data.frame(
  old = c("A", "D", "."),
  new = c("a", "d", "!")
)

## desired result
#   id  x
# 1  1  a
# 2  2  a
# 3  3  B
# 4  4  C
# 5  5  d
# 6  6 AA
# 7  7  !

I can do it with a join, coalesce, unselect as below, but this isn't as clear as I'd like - too many steps.

## This works, but is more steps than I want
library(dplyr)
data %>%
  left_join(lookup, by = c("x" = "old")) %>%
  mutate(x = coalesce(new, x)) %>%
  select(-new)

It can also be done with dplyr::recode, as below, converting the lookup table to a named lookup vector. I prefer lookup as a data frame, but I'm okay with the named vector solution. My concern here is that recode is the Questioning lifecycle phase, so I'm worried that this method isn't stable.

lookup_v = pull(lookup, new) %>% setNames(lookup$old)
data %>%
  mutate(x = recode(x, !!!lookup_v))

It could also be done with, say, stringr::str_replace, but using regex for whole-string matching isn't efficient. I suppose there is forcats::fct_recode is a stable version of recode, but I don't want a factor output (though mutate(x = as.character(fct_recode(x, !!!lookup_v))) is perhaps my favorite option so far...).

I had hoped that the new-ish rows_update() family of dplyr functions would work, but it is strict about column names, and I don't think it can update the column it's joining on. (And it's Experimental, so doesn't yet meet my stability requirement.)

Summary of my requirements:

  • A single data column is updated based off of a lookup data frame (preferably) or named vector (allowable)
  • Not all values in the data are included in the lookup--the ones that are not present are not modified
  • Must work on character class input. Working more generally is a nice-to-have.
  • No dependencies outside of base R and tidyverse packages (though I'd also be interested in seeing a data.table solution)
  • No functions used that are in lifecycle phases like superseded or questioning. Please note any experimental lifecycle functions, as they have future potential.
  • Concise, clear code
  • I don't need extreme optimization, but nothing wildly inefficient (like regex when it's not needed)
Jon Spring
  • 40,151
  • 4
  • 32
  • 50
Gregor Thomas
  • 119,032
  • 17
  • 152
  • 277
  • `data %>% mutate(x = str_replace_all(x, setNames(lookup$new, lookup$old)))` is certainly less efficient than the matching and join, however, I'm not sure whether it's `wildly inefficient` with decent sized datasets. – tmfmnk Apr 13 '21 at 20:38
  • 1
    Well, to be safe you'd need to use something like `sprintf("^%s$", lookup$new)` as the pattern, and even then I'd have concerns about regex special characters in the old names. I'm biased from an old project I used to work on where regex was a performance bottleneck (quite complex regex, mostly), but I definitely have a pet peeve about using regex for exact whole string matches. But on 100k rows my guess is `str_replace_all` will be at least 10x slower than most of these other solutions. – Gregor Thomas Apr 13 '21 at 22:52
  • If the `recode` and `fct_recode` variants can't deal with `.`, then they're not general enough. – Gregor Thomas Apr 20 '21 at 13:30

10 Answers10

15

A direct data.table solution, without %in%.
Depending on the length of the lookup / data tables, adding keys could improve performance substantially, but this isn't the case on this simple example.

library(data.table)

setDT(data)
setDT(lookup)

## If needed
# setkey(data,x)
# setkey(lookup,old)

data[lookup, x:=new, on=.(x=old)]
data 

   id  x
1:  1  a
2:  2  a
3:  3  B
4:  4  C
5:  5  d
6:  6 AA
7:  7  !
Waldi
  • 31,868
  • 6
  • 18
  • 66
  • This is the `data.table` solution I was looking for! Perfect! I didn't think it would work so nicely without a `coalesce` type step. – Gregor Thomas Apr 16 '21 at 14:38
  • @Gregor, glad I could bring the `data.table` solution you were looking for! Would you mind putting `data.table` in the tags, as it also belongs to your question? – Waldi Apr 16 '21 at 20:52
  • 7
    The best part? This code would have worked 8+ years ago, and it will likely work 10 years from today - the respect the `data.table` development team has for their users in regard to avoiding flippant name changes and frequent deprecations makes it a very stable solution compared to some other popular packages. – Matt Summersgill Apr 22 '21 at 02:47
14

Benchmarking

Expanding the original dataset to 10M rows, 15 runs using microbenchmark gave the follow results on my computer:

Note that forcats::fct_recode and dplyr::recode solutions mentioned by the OP have also been included. Neither works with the updated data because the named vector that resolves to . = ! will throw an error, which is why results are tested on the original dataset.

data = data.frame(
  id = 1:5,
  x = c("A", "A", "B", "C", "D")
)

lookup = data.frame(
  old = c("A", "D"),
  new = c("a", "d")
)

set.seed(1)
data <- data[sample(1:5, 1E7, replace = T),]

dt_lookup <- data.table::copy(lookup)

dplyr_coalesce <- function(){
  library(dplyr)
  lookupV <- setNames(lookup$new, lookup$old)
  data %>% 
    dplyr::mutate(x = coalesce(lookupV[ x ], x))
}

datatable_in <- function(){
  library(data.table)
  lookupV <- setNames(lookup$new, lookup$old)
  setDT(dt_data)
  dt_data[ x %in% names(lookupV), x := lookupV[ x ] ]
}

datatable <- function(){
  library(data.table)
  
  setDT(dt_data)
  setDT(dt_lookup)
  
  ## If needed
  # setkey(data,x)
  # setkey(lookup,old)
  
  dt_data[dt_lookup, x:=new, on =.(x=old)]
}

purrr_modify_if <- function(){
  library(dplyr)
  library(purrr)
  lookupV <- setNames(lookup$new, lookup$old)
  data %>% 
    dplyr::mutate(x = modify_if(x, x %in% lookup$old, ~ lookupV[.x]))
}

stringr_str_replace_all_update <- function(){
  library(dplyr)
  library(stringr)
  lookupV <- setNames(lookup$new, do.call(sprintf, list("^\\Q%s\\E$", lookup$old)))
  
  data %>% 
    dplyr::mutate(x = str_replace_all(x, lookupV))
}

base_named_vector <- function(){
  lookupV <- c(with(lookup, setNames(new, old)), rlang::set_names(setdiff(unique(data$x), lookup$old)))
  lookupV[data$x]
}

base_ifelse <- function(){
  lookupV <- setNames(lookup$new, lookup$old)
  with(data, ifelse(x %in% lookup$old, lookup$new, x))
}

plyr_mapvalues <- function(){
  library(plyr)
  data %>% 
    dplyr::mutate(x = plyr::mapvalues(x, lookup$old, lookup$new, warn_missing = F))
}

base_match <- function(){
  tochange <- match(data$x, lookup$old, nomatch = 0)
  data$x[tochange > 0] <- lookup$new[tochange]
}

base_local_safe_lookup <- function(){
  lv <- structure(lookup$new, names = lookup$old)
  
  safe_lookup <- function(val) {
    new_val <- lv[val]
    unname(ifelse(is.na(new_val), val, new_val))
  }
  
  safe_lookup(data$x)
}

dplyr_recode <- function(){
  library(dplyr)
  lookupV <- setNames(lookup$new, lookup$old)
  
  data %>%
    dplyr::mutate(x = recode(x, !!!lookupV))
}

base_for <- function(){
  for (i in seq_len(nrow(lookup))) {
    data$x[data$x == lookup$old[i]] = lookup$new[i]
  }
}

datatable_for <- function(){
  library(data.table)
  setDT(dt_data)
  
  for (i in seq_len(nrow(lookup))) {
    dt_data[x == lookup$old[i], x := lookup$new[i]]
  }
}

forcats_fct_recode <- function(){
  library(dplyr)
  library(forcats)
  lookupV <- setNames(lookup$new, lookup$old)
  
  data %>% 
    dplyr::mutate(x = as.character(fct_recode(x, !!!lookupV)))
  
}

datatable_set <- function(){
  library(data.table)
  setDT(dt_data)
  
  tochange <- dt_data[, chmatch(x, lookup$old, nomatch = 0)]
  set(dt_data, i = which(tochange > 0), j = "x", value = lookup$new[tochange])
}

library(microbenchmark)
bench <- microbenchmark(dplyr_coalesce(),
                        datatable(),
                        datatable_in(),
                        datatable_for(),
                        base_for(),
                        purrr_modify_if(),
                        stringr_str_replace_all_update(),
                        base_named_vector(),
                        base_ifelse(),
                        plyr_mapvalues(),
                        base_match(),
                        base_local_safe_lookup(),
                        dplyr_recode(),
                        forcats_fct_recode(),
                        datatable_set(),
                        times = 15L,
                        setup = dt_data <- data.table::copy(data))

bench$expr <- forcats::fct_rev(forcats::fct_reorder(bench$expr, bench$time, mean))
ggplot2::autoplot(bench)

enter image description here

Thanks to @Waldi and @nicola for advice implementing data.table solutions in the benchmark.

LMc
  • 8,572
  • 3
  • 21
  • 35
  • @LMc, interested to know how the new data.table solution I just posted compares to other solutions. I'm not sure `data – Waldi Apr 16 '21 at 05:58
  • 2
    Sure the data.table solution should not count on having `copy` of a large object inside it. You can make a copy outside and pass that copy to the function and make a fairer comparison I guess. – nicola Apr 16 '21 at 08:34
  • You should exclude the `library` and `setDT` calls from the benchmark. – ismirsehregal Feb 01 '22 at 22:31
9

Combination of a named vector and coalesce:

# make lookup vector
lookupV <- setNames(lookup$new, lookup$old)

data %>% 
  mutate(x = coalesce(lookupV[ x ], x))
#   id x
# 1  1 a
# 2  2 a
# 3  3 B
# 4  4 C
# 5  5 d

Or data.table:

library(data.table)

setDT(data)
data[ x %in% names(lookupV), x := lookupV[ x ] ]

This post might have a better solution for data.table - "update on merge":

zx8754
  • 46,390
  • 10
  • 104
  • 180
8

A base R option using %in% and match - thanks to @LMc & @nicola

tochange <- match(data$x, lookup$old, nomatch = 0)
data$x[tochange > 0] <- lookup$new[tochange]

One more data.table option using set() and chmatch

library(data.table)
setDT(data)

tochange <- data[, chmatch(x, lookup$old, nomatch = 0)]
set(data, i = which(tochange > 0), j = "x", value = lookup$new[tochange])

Result

data
#  id  x
#1  1  a
#2  2  a
#3  3  B
#4  4  C
#5  5  d
#6  6 AA
#7  7  !
markus
  • 24,556
  • 5
  • 34
  • 51
7

modify_if

You could use purrr::modify_if to only apply the named vector to values that exist in it. Though not a specified requirement, it has the benefit of the .else argument, which allows you to apply a different function to values not in your lookup.

I also wanted to include the use of tibble::deframe here to create the named vector. It is slower than setNames, though.

lookupV <- deframe(lookup)

data %>% 
  mutate(x = modify_if(x, x %in% lookup$old, ~ lookupV[.x]))

str_replace_all

Alternatively, you could use stringr::str_replace_all, which can take a named vector for the replacement argument.

data %>% 
  mutate(x = str_replace_all(x, lookupV))

Update

To accommodate the change to your edited example, the named vector used in str_replace_all needs to be modified. In this way, the entire literal string needs to be match so that "A" does not get substituted in "AA", or "." does not replace everything:

lookupV <- setNames(lookup$new, do.call(sprintf, list("^\\Q%s\\E$", lookup$old)))

data %>% 
  mutate(x = str_replace_all(x, lookupV))

left_join

Using dplyr::left_join this is very similar to OP solution, but uses .keep argument of mutate so it has less steps. This argument is currently in the experimental lifecycle and so it is not included in the benchmark (though it is around the middle of posted solutions).

left_join(data, lookup, by = c("x" = "old")) %>% 
    mutate(x = coalesce(new, x) , .keep = "unused")

Base R

Named Vector

Create a substitution value for every unique value in your dataframe.

lookupV <- c(with(lookup, setNames(new, old)), setNames(nm = setdiff(unique(data$x), lookup$old)))

data$x <- lookupV[data$x]

ifelse

with(data, ifelse(x %in% lookup$old, lookupV[x], x))
LMc
  • 8,572
  • 3
  • 21
  • 35
  • 1
    Given the use of unnecessary regex and how slow `str_replace_all` is relative to other solutions, as you predicted I don't think it's a very viable option. – LMc Apr 14 '21 at 15:44
  • 1
    Also, I've never used `\Q` and `\E` in regex, so thanks for showing off that trick! – Gregor Thomas Apr 14 '21 at 16:19
5

Another option that is clear is to use a for-loop with subsetting to loop through the rows of the lookup table. This will almost always be quicker with data.table because of auto indexing, or if you set the key (i.e., ?data.table::setkey()) ahead of time. Also, it will--of course--get slower as the lookup table gets longer. I would guess an update-join would be preferred if there is a long lookup table.

Base R:

for (i in seq_len(nrow(lookup))) {
  data$x[data$x == lookup$old[i]] <- lookup$new[i]
}

data$x
# [1] "a"  "a"  "B"  "C"  "d"  "AA" "!" 

Or the same logic with data.table:

library(data.table)
setDT(data)

for (i in seq_len(nrow(lookup))) {
  data[x == lookup$old[i], x := lookup$new[i]]
}

data$x
# [1] "a"  "a"  "B"  "C"  "d"  "AA" "!" 

Data:

data = data.frame(
  id = 1:7,
  x = c("A", "A", "B", "C", "D", "AA", ".")
)

lookup = data.frame(
  old = c("A", "D", "."),
  new = c("a", "d", "!")
)
Andrew
  • 4,858
  • 2
  • 10
  • 20
4

Another base solution, with a lookup vector:

## Toy example
data = data.frame(
  id = 1:5,
  x = c("A", "A", "B", "C", "D"),
  stringsAsFactors = F
)

lookup = data.frame(
  old = c("A", "D"),
  new = c("a", "d"),
  stringsAsFactors = F
)

lv <- structure(lookup$new, names = lookup$old)

safe_lookup <- function(val) {
  new_val <- lv[val]
  unname(ifelse(is.na(new_val), val, new_val))
}

data$x <- safe_lookup(data$x)
SmokeyShakers
  • 3,172
  • 1
  • 6
  • 17
4

dplyr+plyr solution that is in order with all ur bulletpoints (if u consider plyr in the the tidyverse):

data <- data %>% 
  dplyr::mutate(
    x = plyr::mapvalues(x, lookup$old, lookup$new) #Can add , F to remove warnings
  )
Baraliuh
  • 568
  • 2
  • 12
  • 1
    I consider the tidyverse to be all packages attached with the command `library(tidyverse)`, so I'm afraid not. Though I'm pretty sure it is on the dependency chain, so maybe it's marginal... – Gregor Thomas Apr 14 '21 at 13:28
  • 2
    Ah, that's too bad, as it is quite an elegant solution (golf-coding, robustness, readability, and performance). – Baraliuh Apr 14 '21 at 22:27
4

I basically share the same problem. Although dplyr::recode is in the "questioning" life cycle I don't expect it to become deprecated. At some point it might be superseded, but even in this case it should still be usable. Therefore I'm using a wrapper around dplyr::recode which allows the use of named vectors and or two vectors (which could be a lookup table).

library(dplyr)
library(rlang)

recode2 <- function(x, new, old = NULL, .default = NULL, .missing = NULL) {
  
  if (!rlang::is_named(new) && !is.null(old)) {
    new <- setNames(new, old)
  }
  
  do.call(dplyr::recode,
          c(.x = list(x),
            .default = list(.default),
            .missing = list(.missing),
            as.list(new)))
  
}

data = data.frame(
  id = 1:7,
  x = c("A", "A", "B", "C", "D", "AA", ".")
)

lookup = data.frame(
  old = c("A", "D", "."),
  new = c("a", "d", "!")
)

# two vectors new / old
data %>% 
  mutate(x = recode2(x, lookup$new, lookup$old))
#>   id  x
#> 1  1  a
#> 2  2  a
#> 3  3  B
#> 4  4  C
#> 5  5  d
#> 6  6 AA
#> 7  7  !

# named vector
data %>% 
  mutate(x = recode2(x, c("A" = "a",
                          "D" = "d",
                          "." = "!")))
#>   id  x
#> 1  1  a
#> 2  2  a
#> 3  3  B
#> 4  4  C
#> 5  5  d
#> 6  6 AA
#> 7  7  !

Created on 2021-04-21 by the reprex package (v0.3.0)

TimTeaFan
  • 11,990
  • 2
  • 14
  • 32
0

I essentially proportion the same hassle. Although dplyr::recode is inside the "thinking" life cycle I do not anticipate it to become deprecated. At some point it might be outmoded, but even in this situation it need to still be usable website. Therefore I'm using a wrapper round dplyr::recode which permits the usage of named vectors and or two vectors (which can be a research desk).

Watessofw
  • 1
  • 1