31

I want to compare two binary files. One of them is already stored on the server with a pre-calculated CRC32 in the database from when I stored it originally.

I know that if the CRC is different, then the files are definitely different. However, if the CRC is the same, I don't know that the files are. So, I'm looking for a nice efficient way of comparing the two streams: one from the posted file and one from the file system.

I'm not an expert on streams, but I'm well aware that I could easily shoot myself in the foot here as far as memory usage is concerned.

Palec
  • 11,499
  • 7
  • 57
  • 127
Simon Farrow
  • 1,861
  • 2
  • 20
  • 26

6 Answers6

43
static bool FileEquals(string fileName1, string fileName2)
{
    // Check the file size and CRC equality here.. if they are equal...    
    using (var file1 = new FileStream(fileName1, FileMode.Open))
        using (var file2 = new FileStream(fileName2, FileMode.Open))
            return FileStreamEquals(file1, file2);
}

static bool FileStreamEquals(Stream stream1, Stream stream2)
{
    const int bufferSize = 2048;
    byte[] buffer1 = new byte[bufferSize]; //buffer size
    byte[] buffer2 = new byte[bufferSize];
    while (true) {
        int count1 = stream1.Read(buffer1, 0, bufferSize);
        int count2 = stream2.Read(buffer2, 0, bufferSize);

        if (count1 != count2)
            return false;

        if (count1 == 0)
            return true;

        // You might replace the following with an efficient "memcmp"
        if (!buffer1.Take(count1).SequenceEqual(buffer2.Take(count2)))
            return false;
    }
}
mmx
  • 402,675
  • 87
  • 836
  • 780
  • 6
    Require conunt1 == count2 could be inaccurate, as Stream.Read is free to return a block that has a length less than requested byte count. see http://msdn.microsoft.com/en-us/library/vstudio/system.io.stream.read(v=vs.100).aspx – Karata Nov 18 '14 at 08:49
  • Thanks for the solution Mehrdad. Do you need the Take calls? I tried only `if (!buffer1.SequenceEqual(buffer2))` and it seems to work. – Ozgur Ozturk Jan 08 '15 at 15:22
  • 1
    @Ozgur it works but it is less efficient and not very principled IMO. – mmx Jan 08 '15 at 21:27
  • I agree with to comment by @Karata. You should make sure that the buffer is filled as much as possible by either repeatedly calling Stream.Read(), or use StreamReader.ReadBlock(). – Palec Feb 18 '16 at 13:55
  • @Palec I agree. For a FileStream that's generally not a problem though. I will rename the function to reflect that. – mmx Feb 18 '16 at 22:03
  • 1
    [The docs](https://msdn.microsoft.com/en-us/library/system.io.filestream.read%28v=vs.110%29.aspx) say it may be a problem even for FileStream. Do you mean it is *usually* not a problem? Or are the docs misleading? – Palec Feb 18 '16 at 22:53
  • @Palec Well, I stand corrected then--if the docs say that for FileStream, that's the contract, even if it doesn't fail for local files. Will update the code. – mmx Feb 19 '16 at 21:29
  • 2
    @Karata is absolutely right, and this faulty code should not stand uncorrected (somebody may use it on a piece of software I use later). At the very least `FileStreamEquals` should take two properly typed `FileStream` arguments; a *weak* case can *probably* be made that *usually* a `Read` request for n bytes from a file indeed reads n bytes if nothing went wrong. But would you bet your life (or your company) on every contingency? What about network mapped drives? Named pipes? – Peter - Reinstate Monica Apr 27 '16 at 08:12
  • Comment in February: *"Will update the code"* It's now September. When do you plan to fix the error in the above? – T.J. Crowder Sep 10 '16 at 14:18
  • As other said the (count1 != count2) is wrong. it works for FileStream in many devices but its so deceptive and dangerous. I recommended down vote – Mohammad Nikravan Feb 22 '20 at 00:41
22

I sped up the "memcmp" by using a Int64 compare in a loop over the read stream chunks. This reduced time to about 1/4.

    private static bool StreamsContentsAreEqual(Stream stream1, Stream stream2)
    {
        const int bufferSize = 2048 * 2;
        var buffer1 = new byte[bufferSize];
        var buffer2 = new byte[bufferSize];

        while (true)
        {
            int count1 = stream1.Read(buffer1, 0, bufferSize);
            int count2 = stream2.Read(buffer2, 0, bufferSize);

            if (count1 != count2)
            {
                return false;
            }

            if (count1 == 0)
            {
                return true;
            }

            int iterations = (int)Math.Ceiling((double)count1 / sizeof(Int64));
            for (int i = 0; i < iterations; i++)
            {
                if (BitConverter.ToInt64(buffer1, i * sizeof(Int64)) != BitConverter.ToInt64(buffer2, i * sizeof(Int64)))
                {
                    return false;
                }
            }
        }
    }
Lars
  • 647
  • 8
  • 19
  • Is this advantageous only for 64-bit CPUs or will this help on 32-bit CPUs as well? – Pretzel Jan 24 '13 at 15:31
  • It should not matter if you have a 32-bit or 64-bit operating system running. But I never tried it on a pure 32-bit CPU. You have to try it and maybe just change Int64 to int32. But aren't most of more or less modern CPUs capable of 64-bit operations (x86 since 2004)? Go ahead and try it! – Lars Jan 25 '13 at 08:56
  • See comments on [this answer](http://stackoverflow.com/a/968980/157247). Relying on `count1` equalling `count2` is not reliable. – T.J. Crowder Sep 10 '16 at 14:14
  • Does this handle the last 0-7 bytes correctly? Did you test it on two files where `fileSize % sizeof(Int64) > 0` and only the last byte is different? – Dan Bechard May 02 '17 at 13:40
9

This is how I would do it if you didn't want to rely on crc:

    /// <summary>
    /// Binary comparison of two files
    /// </summary>
    /// <param name="fileName1">the file to compare</param>
    /// <param name="fileName2">the other file to compare</param>
    /// <returns>a value indicateing weather the file are identical</returns>
    public static bool CompareFiles(string fileName1, string fileName2)
    {
        FileInfo info1 = new FileInfo(fileName1);
        FileInfo info2 = new FileInfo(fileName2);
        bool same = info1.Length == info2.Length;
        if (same)
        {
            using (FileStream fs1 = info1.OpenRead())
            using (FileStream fs2 = info2.OpenRead())
            using (BufferedStream bs1 = new BufferedStream(fs1))
            using (BufferedStream bs2 = new BufferedStream(fs2))
            {
                for (long i = 0; i < info1.Length; i++)
                {
                    if (bs1.ReadByte() != bs2.ReadByte())
                    {
                        same = false;
                        break;
                    }
                }
            }
        }

        return same;
    }
wpp
  • 6,747
  • 3
  • 32
  • 63
JonPen
  • 87
  • 1
  • 1
4

The accepted answer had an error that was pointed out, but never corrected: stream read calls are not guaranteed to return all bytes requested.

BinaryReader ReadBytes calls are guaranteed to return as many bytes as requested unless the end of the stream is reached first.

The following code takes advantage of BinaryReader to do the comparison:

    static private bool FileEquals(string file1, string file2)
    {
        using (FileStream s1 = new FileStream(file1, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (FileStream s2 = new FileStream(file2, FileMode.Open, FileAccess.Read, FileShare.Read))
        using (BinaryReader b1 = new BinaryReader(s1))
        using (BinaryReader b2 = new BinaryReader(s2))
        {
            while (true)
            {
                byte[] data1 = b1.ReadBytes(64 * 1024);
                byte[] data2 = b2.ReadBytes(64 * 1024);
                if (data1.Length != data2.Length)
                    return false;
                if (data1.Length == 0)
                    return true;
                if (!data1.SequenceEqual(data2))
                    return false;
            }
        }
    }
Larry
  • 277
  • 3
  • 6
3

if you change that crc to a sha1 signature the chances of it being different but with the same signature are astronomicly small

albertjan
  • 7,561
  • 6
  • 43
  • 73
  • You should never rely on that in most serious apps. It's like just checking the hash in a hashtable lookup without comparing the actual keys! – mmx Jun 09 '09 at 09:14
  • 1
    unfortunately you can guarantee that the one time it messes up will be absolutely critical, probably that one big pitch. – Simon Farrow Jun 09 '09 at 09:17
  • @Simon - hehe very true. @Mehrdad - No probably not but it would greatly reduce the times you'd have to check to be super uber sure. – albertjan Jun 09 '09 at 09:22
  • Take the CRC and say file size and the changes are ever smaller. – kenny Apr 14 '10 at 12:34
  • 2
    @MehrdadAfshari a rather serious app like git relies on exactly this. To quote Linus Torvalds we will "quite likely never ever see it in [collision of two files by comparing sha's] the full history of the universe". Cf. http://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob. – Maate May 08 '13 at 17:59
  • @Maate Git does not rely on SHA1s to verify single file equality (which is a much more likely scenario) though, but yeah, sometimes it makes sense. – mmx May 08 '13 at 18:25
2

You can check the length and dates of the two files even before checking the CRC to possibly avoid the CRC check.

But if you have to compare the entire file contents, one neat trick I've seen is reading the bytes in strides equal to the bitness of the CPU. For example, on a 32 bit PC, read 4 bytes at a time and compare them as int32's. On a 64 bit PC you can read 8 bytes at a time. This is roughly 4 or 8 times as fast as doing it byte by byte. You also would probably wanna use an unsafe code block so that you could use pointers instead of doing a bunch of bit shifting and OR'ing to get the bytes into the native int sizes.

You can use IntPtr.Size to determine the ideal size for the current processor architecture.

Josh
  • 66,410
  • 14
  • 139
  • 154