Removing all characters in a variable after a specific character in r

Question

I have a dataset df1 like so:

snp <- c("rs7513574_T", "rs1627238_A", "rs1171278_C")
p.value <- c(2.635489e-01, 9.836280e-01 , 6.315047e-01  )

df1 <- data.frame(snp, p.value)

I want to remove the _ underscore and the letters after it (representing allele) in df1 and make this into a new dataframe df2

I tried this using the code

df2 <- df1[,c("snp", "allele"):=tstrsplit(`snp`, "_", fixed = TRUE)]

However, this changes the df1 data frame. Is there another way to do this?

Okay, I've cleaned up your example by adding quotes to the strings, gotten rid of `rbind` which was making things rows when you wanted them to be columns, got rid of `matrix()` which was converting everything to `character`. Could you run the sample data code and verify that it is accurate? — Gregor Thomas, Apr 01 '21 at 14:08
Also, your question text just says you want to remove the `_` and the letter after it, but your code seems to be attempting to put the letter after it into a new column called `"allele"` - if you want to do that you should mention it in the text. — Gregor Thomas, Apr 01 '21 at 14:10
That looks great and is an accurate representation of the dataset, thank you! — codemachino, Apr 01 '21 at 14:12
Quite welcome. Next time, please do test your sample data code before posting it :) It saves time for everyone, especially someone like user438383 who made some assumptions and a couple answer attempts based on bad input. — Gregor Thomas, Apr 01 '21 at 14:13

score 1 · Answer 1 · answered Apr 01 '21 at 14:12

1

This is my best guess as to what you want:

library(tidyr)
separate(df1, snp, into = c("snp", "allele"), sep = "_")
#         snp allele   p.value
# 1 rs7513574      T 0.2635489
# 2 rs1627238      A 0.9836280
# 3 rs1171278      C 0.6315047

answered Apr 01 '21 at 14:12

Gregor Thomas

119,032
17
152
277

As there is only one delimiter, you could also do without specifying the `sep` `separate(df1, snp, into = c("snp", "allele"))` – akrun Apr 02 '21 at 01:01

user438383 · Answer 2 · 2021-04-01T14:07:47.257

0

df2 = df1 %>% 
    dplyr::mutate(across(c(V1, V2, V3), ~stringr::str_remove_all(., "_[:alpha:]")))
> df2
               V1        V2        V3
snp     rs7513574 rs1627238 rs1171278
p.value 0.2635489  0.983628 0.6315047

edited Apr 01 '21 at 14:07

answered Apr 01 '21 at 13:47

user438383

4,338
6
23
35

is it possible to do this and create a new dataframe of it? for example a duplicate of `df1` but with the underscore and letter removed? – codemachino Apr 01 '21 at 13:48
Yes - if you edit your question to provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your df then I will edit my answer to show how it's done. You only need to make a small example dataset with a couple of lines,. – user438383 Apr 01 '21 at 13:49
created a reproducible example for you, hopefully that helps! – codemachino Apr 01 '21 at 13:54
done - hopefully that is the thing you were looking for? – user438383 Apr 01 '21 at 14:07

score 0 · Answer 3 · answered Apr 01 '21 at 14:09

0

Try:

df2 <- df1 %>% mutate(snp=gsub("_.","",snp))

answered Apr 01 '21 at 14:09

Marcos Pérez

1,250
1
7

score 0 · Answer 4 · answered Apr 01 '21 at 16:32

0

Consider creating a copy of the dataset and do the tstrsplit on the copied data to avoid changes in original data

library(data.table)
df2 <- copy(df1)
setDT(df2)[,c("snp", "allele") := tstrsplit(snp, "_", fixed = TRUE)]

answered Apr 01 '21 at 16:32

akrun

789,025
32
460
575

Removing all characters in a variable after a specific character in r

4 Answers4