0

I have a dataset df1 like so:

snp <- c("rs7513574_T", "rs1627238_A", "rs1171278_C")
p.value <- c(2.635489e-01, 9.836280e-01 , 6.315047e-01  )

df1 <- data.frame(snp, p.value)

I want to remove the _ underscore and the letters after it (representing allele) in df1 and make this into a new dataframe df2

I tried this using the code

df2 <- df1[,c("snp", "allele"):=tstrsplit(`snp`, "_", fixed = TRUE)]

However, this changes the df1 data frame. Is there another way to do this?

Gregor Thomas
  • 119,032
  • 17
  • 152
  • 277
codemachino
  • 33
  • 1
  • 7
  • Your example of `df1` doesn't work - the line `snp – Gregor Thomas Apr 01 '21 at 14:02
  • Apologies for the messiness. The line `snp – codemachino Apr 01 '21 at 14:04
  • Okay, I've cleaned up your example by adding quotes to the strings, gotten rid of `rbind` which was making things rows when you wanted them to be columns, got rid of `matrix()` which was converting everything to `character`. Could you run the sample data code and verify that it is accurate? – Gregor Thomas Apr 01 '21 at 14:08
  • Also, your question text just says you want to remove the `_` and the letter after it, but your code seems to be attempting to put the letter after it into a new column called `"allele"` - if you want to do that you should mention it in the text. – Gregor Thomas Apr 01 '21 at 14:10
  • That looks great and is an accurate representation of the dataset, thank you! – codemachino Apr 01 '21 at 14:12
  • 1
    Quite welcome. Next time, please do test your sample data code before posting it :) It saves time for everyone, especially someone like user438383 who made some assumptions and a couple answer attempts based on bad input. – Gregor Thomas Apr 01 '21 at 14:13

4 Answers4

1

This is my best guess as to what you want:

library(tidyr)
separate(df1, snp, into = c("snp", "allele"), sep = "_")
#         snp allele   p.value
# 1 rs7513574      T 0.2635489
# 2 rs1627238      A 0.9836280
# 3 rs1171278      C 0.6315047
Gregor Thomas
  • 119,032
  • 17
  • 152
  • 277
  • As there is only one delimiter, you could also do without specifying the `sep` `separate(df1, snp, into = c("snp", "allele"))` – akrun Apr 02 '21 at 01:01
0
df2 = df1 %>% 
    dplyr::mutate(across(c(V1, V2, V3), ~stringr::str_remove_all(., "_[:alpha:]")))
> df2
               V1        V2        V3
snp     rs7513574 rs1627238 rs1171278
p.value 0.2635489  0.983628 0.6315047
user438383
  • 4,338
  • 6
  • 23
  • 35
  • is it possible to do this and create a new dataframe of it? for example a duplicate of `df1` but with the underscore and letter removed? – codemachino Apr 01 '21 at 13:48
  • Yes - if you edit your question to provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) of your df then I will edit my answer to show how it's done. You only need to make a small example dataset with a couple of lines,. – user438383 Apr 01 '21 at 13:49
  • created a reproducible example for you, hopefully that helps! – codemachino Apr 01 '21 at 13:54
  • done - hopefully that is the thing you were looking for? – user438383 Apr 01 '21 at 14:07
0

Try:

df2 <- df1 %>% mutate(snp=gsub("_.","",snp))
Marcos Pérez
  • 1,250
  • 1
  • 7
0

Consider creating a copy of the dataset and do the tstrsplit on the copied data to avoid changes in original data

library(data.table)
df2 <- copy(df1)
setDT(df2)[,c("snp", "allele") := tstrsplit(snp, "_", fixed = TRUE)]
akrun
  • 789,025
  • 32
  • 460
  • 575