I have a dataset with two columns, "Gene ID" and "GO term". Each gene ID can be linked to multiple GO terms and each GO term can be found in multiple gene IDs of the "Gene ID" column. i.e., there are multiple gene IDs and multiple GO terms in both columns
Gene ID GO term
1 ENSRNA049995838
2 VIT_00s0131g00010 GO:0005544
3 VIT_00s0131g00010 GO:0005509
4 VIT_00s0131g00010 GO:0005737
5 VIT_00s0131g00010 GO:0005509
6 VIT_00s0181g00030
7 VIT_00s0181g00180 GO:0016021
8 VIT_00s0181g00180 GO:0009941
9 VIT_00s0181g00180 GO:0009522
10 VIT_00s0181g00180 GO:0016168
11 VIT_00s0181g00180 GO:0009523
12 VIT_00s0181g00180 GO:0018298
13 VIT_00s0181g00180 GO:0009768
14 VIT_00s0181g00180 GO:0009416
15 VIT_00s0181g00180 GO:0009535
16 VIT_00s0181g00180 GO:0010287
17 VIT_00s0181g00180 GO:0016020
18 VIT_00s0181g00180 GO:0009507
19 VIT_00s0181g00180 GO:0009536
20 VIT_00s0181g00180 GO:0015979
21 VIT_00s0181g00180 GO:0009765
22 VIT_00s0181g00180 GO:0009535
23 VIT_00s0181g00180 GO:0009579
24 VIT_00s0207g00010 GO:0005634
25 VIT_00s0207g00010 GO:0016747
26 VIT_00s0207g00010 GO:0009567
27 VIT_00s0207g00010 GO:0005737
28 VIT_00s0207g00010 GO:0016747
29 VIT_00s0371g00100 GO:0008270
30 VIT_00s0371g00100 GO:0016491
I want to transform the data to wide format, with each GO term as one variable, and all gene ID linked to this GO term will be listed in the second column of the same cell, and the third column will be count of the number of gene ID linked to this GO term, like below:
GO term Gene ID Number of Gene ID
GO:0005544 VIT_00s0131g00010 1
GO:0016747 VIT_00s0207g00010, VIT_00s0207g00010 2
GO:0005509 VIT_00s0131g00010, VIT_00s0131g00010 2
I tried tidyr's function spread(), and pivot.wider() and tried to convert the table from long to wide. However they did not produce results as I wanted like above.
Can somebody suggest what I should do? Thanks!