3

When concatenating strings using dplyr, group_by() and mutate() or summarize () with paste() and collapse, NA values are coerced to a string "NA".

When using str_c() instead of paste(), strings that are concatenated with NA are removed (?str_c: whenever a missing value is combined with another string the result will always be missing). When having such combinations of NA & non-NA values, how can I remove the NA instead of the non-NA in the concatenation?

See my example below:

library(dplyr)
library(stringr)
ID <- c(1,1,2,2,3,4)
string <- c(' asfdas ', 'sdf', NA,'sadf', 'NA', NA)
df <- data.frame(ID, string)
#   ID   string
# 1  1  asfdas 
# 2  1      sdf
# 3  2     <NA> # ID 2 has both NA and non-NA values
# 4  2     sadf #
# 5  3       NA
# 6  4     <NA>

Both,

df%>%
 group_by(ID)%>%
 summarize(string = paste(string, collapse = "; "))%>%
 distinct_all()

and

df_conca <-df%>%
 group_by(ID)%>%
 dplyr::mutate(string = paste(string, collapse = "; "))%>%
 distinct_all()

result in

     ID string               
1     1 " asfdas ; sdf"
2     2 "NA; sadf"           
3     3 "NA"
4     4 "NA" # NA coerced to "NA"

I.e. NA becomes "NA":

while

df %>%
  group_by(ID)%>%
  summarize(string = str_c(string, collapse = "; "))

results in:

     ID string               
1     1 " asfdas ; sdf"
2     2 NA     
3     3 "NA" 
4     4 NA 

I.e. "sadf" is removed according to the str_c rule: NA combined with string, results in NA.

However, I would like to keep the true NA values (in e.g. 'ID' 4) and the strings only (in e.g. 'ID' 2), as such:

     ID string             
1     1 " asfdas ; sdf"
2     2 "sadf"           
3     3 "NA"
4     4 NA 

Ideally, I would like to stay within the dplyr workflow.


This question is an extension of Concatenating strings / rows using dplyr, group_by & collapse or summarize, but maintain NA values

Henrik
  • 61,039
  • 13
  • 131
  • 152
MsGISRocker
  • 486
  • 4
  • 16
  • Maybe `df %>% group_by(ID) %>% summarise(str_c(if(any(!is.na(string))) na.omit(string) else string, collapse = ","))` – Henrik Sep 23 '21 at 16:14

4 Answers4

2

Using pivot_wider and unite

library(dplyr)
library(tidyr)
library(data.table)
df %>% 
   mutate(rn = rowid(ID)) %>%
   pivot_wider(names_from = rn, values_from = string) %>% 
   unite(string, `1`, `2`, na.rm = TRUE, sep = " ; ")%>%
   mutate(string = na_if(string, ""))

-output

# A tibble: 4 x 2
     ID string          
  <dbl> <chr>           
1     1 " asfdas  ; sdf"
2     2 "sadf"          
3     3 "NA"            
4     4  <NA>         

Or may also use coalesce

df %>%
    group_by(ID) %>%
    summarise(string = na_if(coalesce(str_c(string, collapse = " ; "),
     str_c(string[complete.cases(string)], collapse = " ; ")), ""))

-output

# A tibble: 4 x 2
     ID string          
  <dbl> <chr>           
1     1 " asfdas  ; sdf"
2     2 "sadf"          
3     3 "NA"            
4     4  <NA>          
akrun
  • 789,025
  • 32
  • 460
  • 575
  • 1
    The second solution requires 1/10 of the processing time as compared to the first solution. It is also less code intense. – MsGISRocker Oct 10 '21 at 07:37
2

Here's a solution in the dplyr framework. This removes all the 'NA' values using filter() - which initially loses ID 4 - then replaces the missing ID using a join.

df_IDs <- data.frame(ID = unique(df$ID))
df%>%
  group_by(ID)%>%
  filter(!is.na(string)) %>%
  summarize(string = paste(string, collapse = "; ")) %>%
  full_join(df_IDs, by = "ID")

results in

     ID string                  
1     1 " asfdas ; sdf"
2     2 "sadf"         
3     3 "NA"           
4     4  NA  
1

So what is the best answer to be granted the accepted answer endorsement?

I enlarged the sample data and ran a short benchmark.

ID <- sample(1:4, 1000000, replace = T)
string <-  sample(c(' asfdas ', 'sdf', NA,'sadf', 'NA', NA), 1000000, replace = T)
df <- data.frame(ID, string)

An the winner is arkun's second answer. Shortest code and shortest processing time. However, the processing time differs only by milliseconds (besides arkun's first answer, which takes tenfold).

df %>%
    group_by(ID) %>%
    summarise(string = na_if(coalesce(str_c(string, collapse = " ; "),
     str_c(string[complete.cases(string)], collapse = " ; ")), ""))

Anyhow, I guess it should be possible accepting multiple answers within stack exchange, as different answers might work best in different circumstances.

Besides, the behavior of dplyr::mutate(string = paste(string, collapse = "; ")) seems quite unexpected to me and would be worth to be changed with some dplyr update.

MsGISRocker
  • 486
  • 4
  • 16
0

Thanks all for your efforts, I came up with my own answer in the meanwhile:

 replace(is.na(.),'XXX_MY_NAs_XXXX')%>%
 group_by(ID)%>%
 summarize(string = paste(string, collapse = "; "))%>%
 dplyr::mutate_all(funs(str_replace_all(., c('XXX_MY_NAs_XXXX; ' = ''
                                          ,'; XXX_MY_NAs_XXXX' = ''))))%>%
 na_if(., 'XXX_MY_NAs_XXXX')
MsGISRocker
  • 486
  • 4
  • 16