In this article, we will learn how to compare two files in C#.
There are several reasons why we might want to compare two files in our code, such as integrity checks, detecting duplicate files, version control, synchronization, and more. But what are the different ways we can compare two files using C#, and why is it not enough to compare using the file name and size?Ā
Let’s answer those questions.
Setup Two Files
In this section, we will prepare two text files that we will use in the subsequent section to investigate different file comparison approaches. We name our first file hello-world.txt
:
hello, world 1
The relative path of our first file is files/batch1/hello-world.txt
. Next, we give our secondĀ file the same name as the first file,Ā hello-world.txt
. However, unlike the first file, the relative path will be files/batch2/hello-world.txt
. Giving the two files the same name will be useful in the later section to understand the flaws of some of the approaches. Finally, let’s inspect the content of the second file:
hello, world 2
Compare Two Files With File Name and Size
Checking file name and size is the most straightforward approach to comparing two files. In C#, we can use the FileInfo
class to compare the file name and size:
public static bool CompareByNameAndSize(string firstFilePath, string secondFilePath, StringComparison fileNameComparisonType) { var fileInfo1 = new FileInfo(firstFilePath); var fileInfo2 = new FileInfo(secondFilePath); if (fileInfo1.Length != fileInfo2.Length) { return false; } return fileInfo1.Name.Equals(fileInfo2.Name, fileNameComparisonType); }
This approach is useful for quickly checking the similarity of files if we know that file names are unique. It is important to note that if two files have different sizes, they are different. However, having the same size does not mean they are the same; therefore, this approach is unreliable. For example, let’s consider two files we prepared in the setup section. They have the same name (hello-world.txt
) and size but different content.
FileInfo
class, be sure to check out our article File and FileInfo Class Comparation in C#.Let’s call the CompareByNameAndSize()
method and pass these two files as arguments:
Console.WriteLine($"CompareByNameAndSize: {FileComparer.CompareByNameAndSize(firstFilePath, secondFilePath, StringComparison.OrdinalIgnoreCase)}");
Let’s inspect the output:
CompareByNameAndSize: True
The CompareByNameAndSize()
method returns true
because the file name and size are the same, which is not the desired result, as the files’ content is different. So, let’s look at different approaches to address this issue.
Compare Two Files Based on Their Bytes
In this section, we will compare two files by iterating over their byte content and checking if all the bytes are similar. We start by checking if the file sizes are the same. If they are not, we return false
. However, if the file sizes are the same, we read the bytes of both files and compare them byte by byte. If all the bytes are the same, we return true
. Otherwise, we return false
.
Let’s implement this approach:
public static bool CompareByBytes(string firstFilePath, string secondFilePath) { const int bufferSize = 1024; using var stream1 = new FileStream(firstFilePath, FileMode.Open, FileAccess.Read); using var stream2 = new FileStream(secondFilePath, FileMode.Open, FileAccess.Read); if (stream1.Length != stream2.Length) { return false; } Span<byte> buffer1 = new byte[bufferSize]; Span<byte> buffer2 = new byte[bufferSize]; while (true) { var bytesRead1 = stream1.Read(buffer1); var bytesRead2 = stream2.Read(buffer2); if (bytesRead1 != bytesRead2) { return false; } if (bytesRead1 == 0) { return true; } if (!buffer1.SequenceEqual(buffer2)) { return false; } } }
In our CompareByBytes()
method, we first check if the file sizes are the same. If they are not, we return false
. Then, we read chunks of bytes from both files and compare them. If chunks are similar, we continue reading the next chunk. Otherwise, we return false
. Making the comparison in chunks allows us to compare large files without loading the entire file into memory. Moreover, if certain chunks are different, we can return false
without reading the entire file.
The SequenceEqual()
method is a nice method from the System
namespace that helps us determine whether the two chunks of bytes are equal or not. Let’s call the CompareByBytes()
method and pass the files from the previous section as arguments:
Console.WriteLine($"CompareByBytes: {FileComparer.CompareByBytes(firstFilePath, secondFilePath)}");
Let’s examine the output:
CompareByBytes: False
The CompareByBytes()
method returns false
because the files’ content is different. This approach is more reliable than comparing file names and sizes. However, it is inefficient, as we will see in the benchmark section.
Compare Two Files Based on Their Checksum
A file checksum is a fixed-size value computed from a file’s content. The checksum value is unique to the file’s content; even a tiny change in the file content will result in a different checksum value. In this section, we will look at how we can implement checksum-based file comparison in C#.
Let’s start by defining the CompareByChecksum()
method:
public static bool CompareByChecksum(string firstFilePath, string secondFilePath) { using var stream1 = new FileStream(firstFilePath, FileMode.Open, FileAccess.Read); using var stream2 = new FileStream(secondFilePath, FileMode.Open, FileAccess.Read); if (stream1.Length != stream2.Length) { return false; } Span<byte> hash1 = stackalloc byte[16]; Span<byte> hash2 = stackalloc byte[16]; MD5.HashData(stream1, hash1); MD5.HashData(stream2, hash2); return hash1.SequenceEqual(hash2); }
Similar to the other approaches, in the CompareByChecksum()
method, we first check if the file sizes are the same. If they are not, we return false
. If the file sizes are the same, we create a 16-byte buffer to store the MD5
. We then use the HashData()
method of the MD5
class to compute the files’ hash. Finally, we compare the hashes to determine if the files are the same. We can use this approach to compare files even if they have the same name and size.
Let’s verify this by comparing files from the file setup section:
Console.WriteLine($"CompareByChecksum: {FileComparer.CompareByChecksum(firstFilePath, secondFilePath)}");
Let’s inspect the output:
CompareByChecksum: False
A hash function results in the same value for the same content. Moreover, hash functions are language agnostic, meaning we can create a hash value in C# and reuse it in other programming languages and vice versa. Therefore, unlike previous approaches, a comparison based on checksum is less reliant on real-time computing. Instead, we can compute the hashes beforehand and reuse them whenever we need to compare files.
Compare Two Files Based on Their Difference
Knowing if files are similar or not is not always enough; sometimes, we may want to know the difference between them. This is useful when we are doing something similar to version control. In this section, we will look at how to implement this using DiffPlex, a NuGet package for textual diff.
First, let’s install the DiffPlex package:
dotnet add package DiffPlex
Next, let’s implement the ComputeDiff()
method:
public static List<DiffPiece> ComputeDiff(string firstFilePath, string secondFilePath) { var text1 = File.ReadAllText(firstFilePath); var text2 = File.ReadAllText(secondFilePath); return InlineDiffBuilder.Diff(text1, text2).Lines; }
The Diff()
static method of the InlineDiffBuilder
class helps us compute the difference and the type of operation for each line.
Finally, let’s leverage the ComputeDiff()
method to compute the diff between the two files:
var result = FileComparer.ComputeDiff(firstFilePath, secondFilePath); foreach (var diff in result) { Console.WriteLine($"{diff.Type}: {diff.Text}"); }
The Type
property indicates the type of operation (insert, delete, modified), and the Text
property contains the actual content of the line. TheĀ value ofĀ the Type
property is Unchanged
if there is no inline difference.
Let’s inspect the output:
Deleted: hello, world 1 Inserted: hello, world 2
Benchmark Comparing Two Files
Now, le’ts benchmark the performance of the different approaches to compare files. We will use the BenchmarkDotNet library to benchmark the performance of the different approaches. In this article, we only inspect the benchmark outputs.
Let’s look at the benchmarks:
| Method | Mean | Error | StdDev | Gen0 | Allocated | |--------------------------- |-------------:|-----------:|-----------:|-------:|----------:| | CompareByNameAndSize | 45,187.32 ns | 152.109 ns | 127.018 ns | 0.6714 | 2272 B | | CompareByBytes | 81,666.18 ns | 296.652 ns | 262.974 ns | 3.9063 | 12577 B | | CompareByChecksum | 83,336.89 ns | 220.490 ns | 184.119 ns | 0.7324 | 2305 B |
Overall, theĀ CompareByNameAndSize()
method performs best among the different approaches. Conversely, the CompareByBytes()
and CompareByChecksum()
methods are the slowest. Moreover, their time complexity increases as the file size increases. It is important to keep in mind that the benchmark only provides a generic comparison, and depending on the use case, the performance may vary.
Conclusion
We learned aboutĀ different approaches to comparing files in C#. We started by comparing files based on name and size and then compared files based on their content. Subsequently, we looked at how we could compare files based on their checksum, which is more reliable than the previous approaches. Finally, we learned how to compute the difference between two files using DiffPlex and benchmarked the performance of each approach.