In this article, we will learn how to compare two files in C#.

There are several reasons why we might want to compare two files in our code, such as integrity checks, detecting duplicate files, version control, synchronization, and more. But what are the different ways we can compare two files using C#, and why is it not enough to compare using the file name and size? 

To download the source code for this article, you can visit our GitHub repository.

Let’s answer those questions.

Setup Two Files

In this section, we will prepare two text files that we will use in the subsequent section to investigate different file comparison approaches. We name our first file hello-world.txt:

hello, world 1

Support Code Maze on Patreon to get rid of ads and get the best discounts on our products!
Become a patron at Patreon!

The relative path of our first file is files/batch1/hello-world.txt. Next, we give our second file the same name as the first file, hello-world.txt. However, unlike the first file, the relative path will be files/batch2/hello-world.txt. Giving the two files the same name will be useful in the later section to understand the flaws of some of the approaches. Finally, let’s inspect the content of the second file:

hello, world 2

Compare Two Files With File Name and Size

Checking file name and size is the most straightforward approach to comparing two files. In C#, we can use the FileInfo class to compare the file name and size:

public static bool CompareByNameAndSize(string firstFilePath, string secondFilePath, StringComparison fileNameComparisonType)
{
    var fileInfo1 = new FileInfo(firstFilePath);
    var fileInfo2 = new FileInfo(secondFilePath);

    if (fileInfo1.Length != fileInfo2.Length)
    {
        return false;
    }

    return fileInfo1.Name.Equals(fileInfo2.Name, fileNameComparisonType);
}

This approach is useful for quickly checking the similarity of files if we know that file names are unique. It is important to note that if two files have different sizes, they are different. However, having the same size does not mean they are the same; therefore, this approach is unreliable. For example, let’s consider two files we prepared in the setup section. They have the same name (hello-world.txt) and size but different content.

To learn more about the FileInfo class, be sure to check out our article File and FileInfo Class Comparation in C#.

Let’s call the CompareByNameAndSize() method and pass these two files as arguments:

Console.WriteLine($"CompareByNameAndSize: {FileComparer.CompareByNameAndSize(firstFilePath, secondFilePath, StringComparison.OrdinalIgnoreCase)}");

Let’s inspect the output:

CompareByNameAndSize: True

The CompareByNameAndSize() method returns true because the file name and size are the same, which is not the desired result, as the files’ content is different. So, let’s look at different approaches to address this issue.

Compare Two Files Based on Their Bytes

In this section, we will compare two files by iterating over their byte content and checking if all the bytes are similar. We start by checking if the file sizes are the same. If they are not, we return false. However, if the file sizes are the same, we read the bytes of both files and compare them byte by byte. If all the bytes are the same, we return true. Otherwise, we return false.

Let’s implement this approach:

public static bool CompareByBytes(string firstFilePath, string secondFilePath)
{
    const int bufferSize = 1024;
    using var stream1 = new FileStream(firstFilePath, FileMode.Open, FileAccess.Read);
    using var stream2 = new FileStream(secondFilePath, FileMode.Open, FileAccess.Read);

    if (stream1.Length != stream2.Length)
    {
        return false;
    }

    Span<byte> buffer1 = new byte[bufferSize];
    Span<byte> buffer2 = new byte[bufferSize];

    while (true)
    {
        var bytesRead1 = stream1.Read(buffer1);
        var bytesRead2 = stream2.Read(buffer2);

        if (bytesRead1 != bytesRead2)
        {
            return false;
        }

        if (bytesRead1 == 0)
        {
            return true;
        }

        if (!buffer1.SequenceEqual(buffer2))
        {
            return false;
        }
    }
}

In our CompareByBytes() method, we first check if the file sizes are the same. If they are not, we return false. Then, we read chunks of bytes from both files and compare them. If chunks are similar, we continue reading the next chunk. Otherwise, we return false. Making the comparison in chunks allows us to compare large files without loading the entire file into memory. Moreover, if certain chunks are different, we can return false without reading the entire file.

The SequenceEqual() method is a nice method from the System namespace that helps us determine whether the two chunks of bytes are equal or not. Let’s call the CompareByBytes() method and pass the files from the previous section as arguments:

Console.WriteLine($"CompareByBytes: {FileComparer.CompareByBytes(firstFilePath, secondFilePath)}");

Let’s examine the output:

CompareByBytes: False

The CompareByBytes() method returns false because the files’ content is different. This approach is more reliable than comparing file names and sizes. However, it is inefficient, as we will see in the benchmark section.

Compare Two Files Based on Their Checksum

A file checksum is a fixed-size value computed from a file’s content. The checksum value is unique to the file’s content; even a tiny change in the file content will result in a different checksum value. In this section, we will look at how we can implement checksum-based file comparison in C#.

Let’s start by defining the CompareByChecksum() method:

public static bool CompareByChecksum(string firstFilePath, string secondFilePath)
{
    using var stream1 = new FileStream(firstFilePath, FileMode.Open, FileAccess.Read);
    using var stream2 = new FileStream(secondFilePath, FileMode.Open, FileAccess.Read);

    if (stream1.Length != stream2.Length)
    {
        return false;
    }

    Span<byte> hash1 = stackalloc byte[16];
    Span<byte> hash2 = stackalloc byte[16];

    MD5.HashData(stream1, hash1);
    MD5.HashData(stream2, hash2);

    return hash1.SequenceEqual(hash2);
}

Similar to the other approaches, in the CompareByChecksum() method, we first check if the file sizes are the same. If they are not, we return false. If the file sizes are the same, we create a 16-byte buffer to store the MD5. We then use the HashData() method of the MD5 class to compute the files’ hash. Finally, we compare the hashes to determine if the files are the same. We can use this approach to compare files even if they have the same name and size.

Let’s verify this by comparing files from the file setup section:

Console.WriteLine($"CompareByChecksum: {FileComparer.CompareByChecksum(firstFilePath, secondFilePath)}");

Let’s inspect the output:

CompareByChecksum: False

A hash function results in the same value for the same content. Moreover, hash functions are language agnostic, meaning we can create a hash value in C# and reuse it in other programming languages and vice versa. Therefore, unlike previous approaches, a comparison based on checksum is less reliant on real-time computing. Instead, we can compute the hashes beforehand and reuse them whenever we need to compare files.

Compare Two Files Based on Their Difference

Knowing if files are similar or not is not always enough; sometimes, we may want to know the difference between them. This is useful when we are doing something similar to version control. In this section, we will look at how to implement this using DiffPlex, a NuGet package for textual diff.

First, let’s install the DiffPlex package:

dotnet add package DiffPlex

Next, let’s implement the ComputeDiff() method:

public static List<DiffPiece> ComputeDiff(string firstFilePath, string secondFilePath)
{
    var text1 = File.ReadAllText(firstFilePath);
    var text2 = File.ReadAllText(secondFilePath);

    return InlineDiffBuilder.Diff(text1, text2).Lines;
}

The Diff() static method of the InlineDiffBuilder class helps us compute the difference and the type of operation for each line.

Finally, let’s leverage the ComputeDiff() method to compute the diff between the two files:

var result = FileComparer.ComputeDiff(firstFilePath, secondFilePath);
foreach (var diff in result)
{
    Console.WriteLine($"{diff.Type}: {diff.Text}");
}

The Type property indicates the type of operation (insert, delete, modified), and the Text property contains the actual content of the line. The value of the Type property is Unchanged if there is no inline difference.

Let’s inspect the output:

Deleted: hello, world 1
Inserted: hello, world 2

Benchmark Comparing Two Files

Now, le’ts benchmark the performance of the different approaches to compare files. We will use the BenchmarkDotNet library to benchmark the performance of the different approaches. In this article, we only inspect the benchmark outputs.

Let’s look at the benchmarks:

| Method                     | Mean         | Error      | StdDev     | Gen0   | Allocated |
|--------------------------- |-------------:|-----------:|-----------:|-------:|----------:|
| CompareByNameAndSize       | 45,187.32 ns | 152.109 ns | 127.018 ns | 0.6714 |    2272 B |
| CompareByBytes             | 81,666.18 ns | 296.652 ns | 262.974 ns | 3.9063 |   12577 B |
| CompareByChecksum          | 83,336.89 ns | 220.490 ns | 184.119 ns | 0.7324 |    2305 B |

Overall, the CompareByNameAndSize() method performs best among the different approaches. Conversely, the CompareByBytes() and CompareByChecksum() methods are the slowest. Moreover, their time complexity increases as the file size increases. It is important to keep in mind that the benchmark only provides a generic comparison, and depending on the use case, the performance may vary.

Conclusion

We learned about different approaches to comparing files in C#. We started by comparing files based on name and size and then compared files based on their content. Subsequently, we looked at how we could compare files based on their checksum, which is more reliable than the previous approaches. Finally, we learned how to compute the difference between two files using DiffPlex and benchmarked the performance of each approach.

Liked it? Take a second to support Code Maze on Patreon and get the ad free reading experience!
Become a patron at Patreon!