0

I have a data frame consisting of an ID column a clones column and 'Isolate' column.

Each ID is present multiple times within the ID column and are associated with different clones in the clone column named as clone 1, clone 2 clone 3 etc which come from distinct isolates. Each ID may have the same clone multiple times too

e.g.

ID  clones  Isolate
ID1 clone1    1
ID1 clone1    2 
ID1 clone1    3 
ID2 clone1    4
ID2 clone1    5
ID2 clone2    6
ID2 clone2    7
ID3 clone1    8
ID3 clone1    9
ID3 clone2    10
ID3 clone3    11
ID3 clone3    12

I want to select at random for each unique ID one representative of each clone.

I expect to get an output like this:

ID  clones   Isolate
ID1 clone1      2
ID2 clone1      5
ID2 clone2      6
ID3 clone1      8
ID3 clone2     10
ID3 clone3     12

with a representative clone for each ID chosen at random, so random isolate column

1 Answers1

0

It seems like you can use the results of a similar question asked just now: How to use R to identify twins, and then randomly select and remove one?

If you use dplyr's group_by function, for ID and clone, and sample_n(1) of those, you should get only one rep for each ID and clone pair. Borrowing from @Andrew Gustar's answer:

library(dplyr)

df %>% 
  group_by(ID, clones) %>% 
  sample_n(1)
Mike
  • 28
  • 4
  • If you think this is a duplicate of an existing question, it's better to flag it as such instead of adding a duplicate answer – camille Jul 24 '19 at 21:10