Collapsing unique rows but retaining a variable in R

Question

I am relatively new to R and don't know exactly how to phrase my question. Basically, I have a dataframe test that looks like:

PMID     PL           subject
1        Canada       neurology
2        USA          cancer
5        Canada       dermatology
2        USA          respiratory
4        Japan        neurology
2        USA          cancer
5        Canada       cardiovascular

which I want to convert into

PMID      PL        subject
1         Canada    neurology
2         USA       cancer, respiratory
5         Canada    dermatology, cardiovascular
4         Japan     neurology

In essence, each PMID can correlate to multiple subjects, so I want to retain that information. I want only unique PMID rows. I also do want do delete repeat occurrences, however (for instance there are 3 rows of "2" but 2 of them are "cancer." Also, I have other variables too, and each PMID has the same values for each of the other variables (except for subject).

Please advise.

Thanks!

BENY · Accepted Answer · 2017-08-08T19:18:20.730

5

Try this by using dplyr

dat%>%group_by(PMID)%>%dplyr::summarise(subject=toString(unique(subject)))
# A tibble: 4 x 2
   PMID                     subject
  <int>                       <chr>
1     1                   neurology
2     2         cancer, respiratory
3     4                   neurology
4     5 dermatology, cardiovascular

2nd approach

dat1=dat[!duplicated((dat)),]
aggregate(dat1$subject, list(dat1$PMID), paste, collapse=",")

EDIT1 : base on your updated data.frame , you should using mutate

dat%>%group_by(PMID)%>%dplyr::mutate(subject=toString(unique(subject)))%>% distinct(PMID, .keep_all = TRUE)


# Groups:   PMID [4]
   PMID     PL                     subject
  <int>  <chr>                       <chr>
1     1 Canada                   neurology
2     2    USA         cancer, respiratory
3     5 Canada dermatology, cardiovascular
4     4  Japan                   neurology

edited Aug 08 '17 at 19:18

answered Aug 07 '17 at 21:51

BENY

296,997
19
147
204

Hi there, I updated my dataset in my question. I have more than 2 variables actually and your code only retains the two variables and excludes the other variables (that are constant among the same PMID) – sweetmusicality Aug 08 '17 at 19:08
@sweetmusicality check my updated answer. – BENY Aug 08 '17 at 19:18
thanks so much :) it works! – sweetmusicality Aug 08 '17 at 19:29
@sweetmusicality glad it help , nice day – BENY Aug 08 '17 at 19:31

score 1 · Answer 2 · answered Aug 08 '17 at 03:07

Here is another option with data.table

library(data.table)
unique(setDT(df1))[, .(subject = toString(subject)), by = PMID]
#   PMID                     subject
#1:    1                   neurology
#2:    2         cancer, respiratory
#3:    5 dermatology, cardiovascular
#4:    4                   neurology

Collapsing unique rows but retaining a variable in R

2 Answers2