0

I have a tsv file with data from some event participants. Here is a small snippet from it:

...
sub-09          37   F    19780726   20160328    20160329
sub-10          38   F    19780208   20160406    20160407
sub-11          39   M    19770511   20160704    20160705
...
sub-42          37   F    19780726   20160328    20160329
...

Note that sub-09 and sub-42 are duplicates.

In BASH, how can I find duplicate lines but ignoring the first (or in general any other) column? I've seen similar threads, e.g., this one, but I couldn't find an answer that fits. Ideally I would get both occurrences of all duplicates, as in:

Expected output:

sub-09          37   F    19780726   20160328    20160329
sub-42          37   F    19780726   20160328    20160329
dangom
  • 9,788
  • 8
  • 41
  • 65

3 Answers3

2

Use uniq -d to show duplicates. Use its -f option to skip fields. As uniq needs the input sorted, first sort ignoring the first column:

sort -nk2 file | uniq -f1 -d

Use -D instead of -d if you want all the duplicates.

choroba
  • 216,930
  • 22
  • 195
  • 267
1

Here is an awk based solution that avoids sorting the file (which can be pretty expensive for a large file):

awk '{
   p = $1
   $1 = ""
   freq[$0]++
   col1[$0,freq[$0]] = p
} 
END {
   for (i in freq)
      for (j=1; freq[i]>1 && j<=freq[i]; j++)
         print col1[i,j] i
 }' file

sub-09 37 F 19780726 20160328 20160329
sub-42 37 F 19780726 20160328 20160329
anubhava
  • 713,503
  • 59
  • 514
  • 593
0
awk 'FNR==NR{$1="";a[$0]++;next}{s=$0;$1="";if(a[$0]>=2) print s}' file file
zxy
  • 148
  • 1
  • 2