1

I am relatively new to R and don't know exactly how to phrase my question. Basically, I have a dataframe test that looks like:

PMID     PL           subject
1        Canada       neurology
2        USA          cancer
5        Canada       dermatology
2        USA          respiratory
4        Japan        neurology
2        USA          cancer
5        Canada       cardiovascular

which I want to convert into

PMID      PL        subject
1         Canada    neurology
2         USA       cancer, respiratory
5         Canada    dermatology, cardiovascular
4         Japan     neurology

In essence, each PMID can correlate to multiple subjects, so I want to retain that information. I want only unique PMID rows. I also do want do delete repeat occurrences, however (for instance there are 3 rows of "2" but 2 of them are "cancer." Also, I have other variables too, and each PMID has the same values for each of the other variables (except for subject).

Please advise.

Thanks!

sweetmusicality
  • 929
  • 9
  • 26

2 Answers2

5

Try this by using dplyr

dat%>%group_by(PMID)%>%dplyr::summarise(subject=toString(unique(subject)))
# A tibble: 4 x 2
   PMID                     subject
  <int>                       <chr>
1     1                   neurology
2     2         cancer, respiratory
3     4                   neurology
4     5 dermatology, cardiovascular

2nd approach

dat1=dat[!duplicated((dat)),]
aggregate(dat1$subject, list(dat1$PMID), paste, collapse=",")

EDIT1 : base on your updated data.frame , you should using mutate

dat%>%group_by(PMID)%>%dplyr::mutate(subject=toString(unique(subject)))%>% distinct(PMID, .keep_all = TRUE)


# Groups:   PMID [4]
   PMID     PL                     subject
  <int>  <chr>                       <chr>
1     1 Canada                   neurology
2     2    USA         cancer, respiratory
3     5 Canada dermatology, cardiovascular
4     4  Japan                   neurology
BENY
  • 296,997
  • 19
  • 147
  • 204
1

Here is another option with data.table

library(data.table)
unique(setDT(df1))[, .(subject = toString(subject)), by = PMID]
#   PMID                     subject
#1:    1                   neurology
#2:    2         cancer, respiratory
#3:    5 dermatology, cardiovascular
#4:    4                   neurology
akrun
  • 789,025
  • 32
  • 460
  • 575