1
33467389|t|Immune Therapies for Hematologic Malignancies.
33467389|a|The era of immunotherapy for hematologic malignancies began with the first allogeneic hematopoietic stem cell transplant (HSCT) study published by E [...].
33477248|t|Unraveling the Role of Innate Lymphoid Cells in AcuteMyeloid Leukemia.
33477248|a|Over the past 50 years, few therapeutic advances have been made in treating.

This is my recurring pattern in my file.

ID which is a number for example 33467389 and the |t|which is the title of the paper. Similarly 33467389|a|this denotes the abstract ID of the paper.

lines <- readLines("output_1/Gemtuzumab_Adult/G1.txt")

So im reading the file like this

So this pattern throughout my text. Is there any way to split this into columns as such

ID                              Abstract 
33467389      The era of immunotherapy for hematologic malignancies
PesKchan
  • 557
  • 2
  • 9

2 Answers2

1

Using sub here is one base R option:

df$ID <- sub("\\|.*$", "", df$text)
df$Abstract <- sub("^.*\\|", "", df$text)
df[, c("ID", "Abstract")]

        ID Abstract
1 33467389 Immune Therapies for Hematologic Malignancies.
2 33467389 The era of immunotherapy for hematologic malignancies began with the first allogeneic hematopoietic stem cell transplant (HSCT) study published by E [...].
3 33477248 Unraveling the Role of Innate Lymphoid Cells in AcuteMyeloid Leukemia.
4 33477248 Over the past 50 years, few therapeutic advances have been made in treating.
Tim Biegeleisen
  • 451,927
  • 24
  • 239
  • 318
1

Two tidyverse solutions:

Data

string <- data.frame(ID = "33467389|a|The era of immunotherapy for hematologic malignancies began with the first allogeneic hematopoietic stem cell transplant (HSCT) study published by E [...].")

Option 1

df <- string %>% 
          mutate(IDs = strsplit(ID, "\\|.*\\|")) %>%
                   tidyr::unnest_wider(IDs, names_sep = "_")

Option 2

string %>% 
         tidyr::separate(ID, sep = "\\|.*\\|", into = c("IDs", "Title"))
Taufi
  • 1,477
  • 7
  • 12