199

I have a series of text files for which I'd like to know the lines in common rather than the lines which are different between them. Command line Unix or Windows is fine.

File foo:

linux-vdso.so.1 =>  (0x00007fffccffe000)
libvlc.so.2 => /usr/lib/libvlc.so.2 (0x00007f0dc4b0b000)
libvlccore.so.0 => /usr/lib/libvlccore.so.0 (0x00007f0dc483f000)
libc.so.6 => /lib/libc.so.6 (0x00007f0dc44cd000)

File bar:

libkdeui.so.5 => /usr/lib/libkdeui.so.5 (0x00007f716ae22000)
libkio.so.5 => /usr/lib/libkio.so.5 (0x00007f716a96d000)
linux-vdso.so.1 =>  (0x00007fffccffe000)

So, given these two files above, the output of the desired utility would be akin to file1:line_number, file2:line_number == matching text (just a suggestion; I really don't care what the syntax is):

foo:1, bar:3 == linux-vdso.so.1 =>  (0x00007fffccffe000)
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
matt wilkie
  • 15,781
  • 22
  • 74
  • 107
  • @ChristopherSchultz My mistake. 1st line in 1st example supposed match last line in 2nd example. Thanks for catching the mistake; changing. – matt wilkie Jul 22 '15 at 17:25
  • 2
    Another similar question with good answers: http://unix.stackexchange.com/questions/1079/output-the-common-lines-similarities-of-two-text-files-the-opposite-of-diff – MortezaE Sep 25 '15 at 08:58

8 Answers8

236

On *nix, you can use comm. The answer to the question is:

comm -1 -2 file1.sorted file2.sorted 
# where file1 and file2 are sorted and piped into *.sorted

Here's the full usage of comm:

comm [-1] [-2] [-3 ] file1 file2
-1 Suppress the output column of lines unique to file1.
-2 Suppress the output column of lines unique to file2.
-3 Suppress the output column of lines duplicated in file1 and file2. 

Also note that it is important to sort the files before using comm, as mentioned in the man pages.

mooreds
  • 4,673
  • 2
  • 30
  • 37
Dan Lew
  • 83,807
  • 30
  • 180
  • 174
  • 3
    comm [-1] [-2] [-3 ] file1 file2 -1 Suppress the output column of lines unique to file1. -2 Suppress the output column of lines unique to file2. -3 Suppress the output column of lines duplicated in file1 and file2. – ojblass Apr 14 '09 at 05:43
  • @ojblass: Added this to the answer. – Matt J Apr 14 '09 at 07:15
  • 8
    I discovered it is important the files be sorted before using comm. Perhaps add that to the answer. – matt wilkie Apr 21 '09 at 16:14
  • 12
    short answer to the question: comm -1 -2 file1 file2 – greggles Nov 02 '12 at 00:16
  • 7
    You can use this if your files aren't sorted: comm -1 -2 – Kevin Wheeler Dec 10 '15 at 20:30
  • In the "there's more than one way to skin a cat" department, `diff --unchanged-line-format='%L' --old-line-format='' --new-line-format=''` should produce identical output if, for some reason, comm is not available. – user3396385 Oct 11 '16 at 15:23
  • For ppl who are wondering: You can create `file1.sorted` via executing `sort file1 > file1.sorted` – los_floppos Feb 22 '22 at 13:31
75

I found this answer on a question listed as a duplicate. I find grep to be more administrator-friendly than comm, so if you just want the set of matching lines (useful for comparing CSV files, for instance) simply use

grep -F -x -f file1 file2

Or the simplified fgrep version:

fgrep -xf file1 file2

Plus, you can use file2* to glob and look for lines in common with multiple files, rather than just two.

Some other handy variations include

  • -n flag to show the line number of each matched line
  • -c to only count the number of lines that match
  • -v to display only the lines in file2 that differ (or use diff).

Using comm is faster, but that speed comes at the expense of having to sort your files first. It isn't very useful as a 'reverse diff'.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Ryder
  • 908
  • 11
  • 13
  • 1
    thanks Ryder, this could more useful than comm to many. You should link to the source answer (there are over half a dozen linked in Q in right-hand nav; it's a bit of work to find). It would also be nice to know how well grep does with un- or differently sorted input, and can print respective line numbers of matches. – matt wilkie Jan 15 '15 at 17:12
  • 2
    @mattwilkie I felt the need to come back and clarify the use of the `-v` flag after I slipped up with it myself. Say you have two csv files file1 and file2, and they have both overlapping and non-overlapping rows. If you want all and only the non-overlapping rows, using `fgrep -v file1 file2` will only return the non-overlapping rows in file2, *and none of the additional non-overlapping rows in file1*. This may be obvious to some, but better to state the obvious than risk misinterpretation. In this particular case, sorting the files and using `comm` is still the better choice. – Ryder May 12 '15 at 08:44
  • 1
    Thank you for coming back and clarifying Ryder. The extra attention is noted and appreciated (all t0o easy to let old things slip away!). I've switched the accepted answer because comm is clearly the community's choice, even though personally I still use this when sorting is unwanted overhead. – matt wilkie May 12 '15 at 18:18
  • 2
    Another complication when using `grep`: any blank line in the first file will match every line in the second file. Make sure `file1` has no blank lines, or it will look like the files are identical. – Christopher Schultz Jul 22 '15 at 14:11
  • `grep -Fxf` it is for me. – loxaxs Mar 17 '18 at 12:03
  • I find this better than `comm` because it is able to catch more similar lines between two different source codes. The idea is that, I want to determine if two source files are related sometime in their past versions. – eigenfield Sep 05 '18 at 18:30
36

It was asked here before: Unix command to find lines common in two files

You could also try with Perl (credit goes here):

perl -ne 'print if ($seen{$_} .= @ARGV) =~ /10$/'  file1 file2
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
ChristopheD
  • 106,997
  • 27
  • 158
  • 177
23

I just learned the comm command from the answers, but I wanted to add something extra: if the files are not sorted, and you don't want to touch the original files, you can pipe the output of the sort command. This leaves the original files intact. It works in Bash, but I can't say about other shells.

comm -1 -2 <(sort file1) <(sort file2)

This can be extended to compare command output, instead of files:

comm -1 -2 <(ls /dir1 | sort) <(ls /dir2 | sort)
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Greg Mueller
  • 486
  • 3
  • 7
13

The easiest way to do it is:

awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2

Files are not necessary to be sorted.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Gopu
  • 952
  • 2
  • 10
  • 20
  • 3
    This is unlike most of the answers here in that it allows you to reconstruct source templates. I have two files built from the same wrapper, with different text inserted at a few points. This answer enabled me to recover the wrapper. – Lucas Gonze Aug 03 '17 at 21:54
  • Explanation can be found in this question https://stackoverflow.com/q/32481877 or in Idiomatic AWK blog referenced from one of its comments. – Tomáš Záluský Apr 08 '22 at 07:13
1

Just for information, I made a little tool for Windows doing the same thing as "grep -F -x -f file1 file2" (As I haven't found anything equivalent to this command on Windows)

Here it is: http://www.nerdzcore.com/?page=commonlines

Usage is "CommonLines inputFile1 inputFile2 outputFile"

Source code is also available (GPL).

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
1

In Windows, you can use a PowerShell script with CompareObject:

compare-object -IncludeEqual -ExcludeDifferent -PassThru (get-content A.txt) (get-content B.txt)> MATCHING.txt | Out-Null #Find Matching Lines

CompareObject:

  • IncludeEqual without -ExcludeDifferent: Everything
  • ExcludeDifferent without -IncludeEqual: Nothing
Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Shrike
  • 58
  • 4
0

I think diff utility itself, using its unified (-U) option, can be used to achieve effect. Because the first column of output of diff marks whether the line is an addition, or deletion, we can look for lines that haven't changed.

diff -U1000 file_1 file_2 | grep '^ '

The number 1000 is chosen arbitrarily, big enough to be larger than any single hunk of diff output.

Here's the full, foolproof set of commands:

f1="file_1"
f2="file_2"

lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))

diff -U$lcmax "$f1" "$f2" | grep '^ ' | less

# Alternatively, use this grep to ignore the lines starting
# with +, -, and @ signs.
#   grep -vE '^[+@-]'

If you want to include the lines that are just moved around, you can sort the input before diffing, like so:

f1="file_1"
f2="file_2"

lc1=$(wc -l "$f1" | cut -f1 -d' ')
lc2=$(wc -l "$f2" | cut -f1 -d' ')
lcmax=$(( lc1 > lc2 ? lc1 : lc2 ))

diff -U$lcmax <(sort "$f1") <(sort "$f2") | grep '^ ' | less
Gurjeet Singh
  • 2,235
  • 1
  • 24
  • 22