Split strings into column based on pattern

Question

33467389|t|Immune Therapies for Hematologic Malignancies.
33467389|a|The era of immunotherapy for hematologic malignancies began with the first allogeneic hematopoietic stem cell transplant (HSCT) study published by E [...].
33477248|t|Unraveling the Role of Innate Lymphoid Cells in AcuteMyeloid Leukemia.
33477248|a|Over the past 50 years, few therapeutic advances have been made in treating.

This is my recurring pattern in my file.

ID which is a number for example 33467389 and the |t|which is the title of the paper. Similarly 33467389|a|this denotes the abstract ID of the paper.

lines <- readLines("output_1/Gemtuzumab_Adult/G1.txt")

So im reading the file like this

So this pattern throughout my text. Is there any way to split this into columns as such

ID                              Abstract 
33467389      The era of immunotherapy for hematologic malignancies

Read the file with sep = "|", then drop 2nd column. – zx8754 Feb 08 '21 at 10:15 — zx8754, Feb 08 '21 at 10:15

Tim Biegeleisen · Accepted Answer · 2021-02-08T10:18:14.430

1

Using sub here is one base R option:

df$ID <- sub("\\|.*$", "", df$text)
df$Abstract <- sub("^.*\\|", "", df$text)
df[, c("ID", "Abstract")]

        ID Abstract
1 33467389 Immune Therapies for Hematologic Malignancies.
2 33467389 The era of immunotherapy for hematologic malignancies began with the first allogeneic hematopoietic stem cell transplant (HSCT) study published by E [...].
3 33477248 Unraveling the Role of Innate Lymphoid Cells in AcuteMyeloid Leukemia.
4 33477248 Over the past 50 years, few therapeutic advances have been made in treating.

edited Feb 08 '21 at 10:18

answered Feb 08 '21 at 10:15

Tim Biegeleisen

451,927
24
239
318

okay let me try it and update you – PesKchan Feb 08 '21 at 10:16
Im reading the files like this 'lines – PesKchan Feb 08 '21 at 10:18
@PesKchan Then read your data into a data frame first. Most of your data manipulation should probably be happening with data frames/tables. [See this relevant answer](https://stackoverflow.com/questions/29899155/how-do-you-convert-output-from-readlines-to-data-frame-in-r/33964519). – Tim Biegeleisen Feb 08 '21 at 10:19
okay let me see how to read character files into data frame – PesKchan Feb 08 '21 at 10:22

score 1 · Answer 2 · answered Feb 08 '21 at 10:24

Two tidyverse solutions:

Data

string <- data.frame(ID = "33467389|a|The era of immunotherapy for hematologic malignancies began with the first allogeneic hematopoietic stem cell transplant (HSCT) study published by E [...].")

Option 1

df <- string %>% 
          mutate(IDs = strsplit(ID, "\\|.*\\|")) %>%
                   tidyr::unnest_wider(IDs, names_sep = "_")

Option 2

string %>% 
         tidyr::separate(ID, sep = "\\|.*\\|", into = c("IDs", "Title"))

Split strings into column based on pattern

2 Answers2