In this article, we are going to share the different methods on how to remove duplicates from an array in C#. In addition, we are going to compare these methods to check their performance.
Arrays contain multiple values of the same type. It’s a common occurrence to have some duplicate elements in an array. Although, in some cases, we want the elements to be unique.
We can achieve this manually or by using some of the .NET inbuilt methods.
Let’s start.
Remove Duplicates Using the LINQ Distinct Method
To begin with, let’s create an array containing some duplicate values. We will use this array for all the methods defined:
string[] arrayWithDuplicateValues = new string[] { "Lovely", "Lovely", "Boy", "Boy" };
For the duplicate values in our array, we can use the LINQ Distinct()
method to remove them:
// before: {"Lovely", "Lovely", "Boy", "Boy" } var distinctArray = arrayWithDuplicateValues.Distinct().ToArray(); // after: {"Lovely", "Boy" }
This is done by invoking the Distinct()
method on the array. The response is an enumerable object. ToArray()
converts the response to an array with unique values.
Remove Duplicates Using HashSet
HashSet is an unordered collection of a specific type. We can remove duplicates from an array by casting it to a HashSet
and then converting it back to an array:
var distinctArray = arrayWithDuplicateValues.ToHashSet().ToArray();
We use ToHashSet()
on the array to remove the duplicate values. The ToHashSet()
method returns an enumerable collection. ToArray()
will alter the response to an array.
Using the HashSet Constructor
Similarly, one of the constructors for HashSet
accepts any collection of values of a particular type:
var hashSet = new HashSet<T>(arrayWithDuplicateValues); var distinctArray = hashSet.ToArray();
By default, the compiler will eliminate the duplicate values while constructing the HashSet.
Remove Duplicates Using LINQ GroupBy and Select Method
Another way to remove duplicate elements in an array is by combining the GroupBy()
and Select()
methods:
var distinctArray = arrayWithDuplicateValues.GroupBy(d => d).Select(d => d.First()).ToArray();
We use GroupBy()
to organize the elements into groups, and then we select the first item from each of the groups and convert them to an array.
Loop Through the Array and Remove Elements Manually
We can loop through the array and remove the items that occur more than once. We can use the initial array but shift the elements in the array once we hit a duplicate or use a temporary collection to hold the unique elements:
var size = arrayWithDuplicateValues.Length; for (int i = 0; i < size; i++) { for (int j = i + 1; j < size; j++) { if (arrayWithDuplicateValues[i] == arrayWithDuplicateValues[j]) { for (int k = j; k < size - 1; k++) { arrayWithDuplicateValues[k] = arrayWithDuplicateValues[k + 1]; } j--; size--; } } } var distinctArray = arrayWithDuplicateValues[0..size];
The if statement in the second loop compares all elements in the array with each other.
Inside the if block, we iterate through the entire array from the duplicate element index and shift the elements to one position left. The order of elements in our array changes during shifting. Once we are done with our iteration, the last element becomes obsolete. Hence, we will reduce the size of the array by 1 and cut it when the iteration is over.
This is far from optimal, but we’re going to see what the benchmark show at the end.
Using Swapping Instead of Shifting Elements to Remove Them
We can use a swapping algorithm instead of the shifting one (the one we previously used) to increase the performance of our code:
public string[] IterationAndSwappingElements(string[] arrayWithDuplicateValues) { var size = arrayWithDuplicateValues.Length; for (int i = 0; i < size; i++) { for (int j = i + 1; j < size; j++) { if (arrayWithDuplicateValues[i] == arrayWithDuplicateValues[j]) { size--; arrayWithDuplicateValues[j] = arrayWithDuplicateValues[size]; j--; } } } return arrayWithDuplicateValues[0..size]; }
This is a similar code to the previous one, but as you will see in the benchmark a faster one. Thanks a lot, James Curran for suggesting this (you can read more about it in the comment section).
Utilizing the Dictionary Key to Remove Duplicates
We can also use a dictionary within the for loop to hold the unique elements in our array:
var dic = new Dictionary<string, int>(); foreach (var s in arrayWithDuplicateValues) { dic.TryAdd(s, 1); } var distinctArray = dic.Select(x => x.Key.ToString()).ToArray();
The dictionary keys represent all the unique elements in the array. We’re utilizing one of the important properties of the Dictionary collection – all the keys in the dictionary must be unique.
Improving This Code
If we just modify the last code line to use the Keys property, we can improve performance:
public string[] IterationWithDictionaryOpt(string[] arrayWithDuplicateValues) { var dic = new Dictionary<string, int>(); foreach (var s in arrayWithDuplicateValues) { dic.TryAdd(s, 1); } return dic.Keys.ToArray(); }
We add the implementation in a different method so we could compare them in our benchmark test.
Removing Elements by Using Recursion
Similarly, we can also construct a recursive method to remove duplicates in an array:
public string[] RecursiveMethod(string[] arrayWithDuplicateValues, List<string>? mem = default, int index = 0) { if (mem == null) { mem = new List<string>(); } if (index >= arrayWithDuplicateValues.Length) { return arrayWithDuplicateValues; } if (mem.IndexOf(arrayWithDuplicateValues[index]) < 0) { mem.Add(arrayWithDuplicateValues[index]); } RecursiveMethod(arrayWithDuplicateValues, mem, index + 1); return mem.ToArray(); }
While the function is calling itself, we are using a list to hold all unique values in the array. During this process, we only add values missing to the list. The function stops calling itself when index
is equal to the length of the array.
Performance Metrics
Now, we are going to compare the performance of the different methods discussed so far:
| Method | Mean | Error | StdDev | Rank | Gen 0 | Allocated | |--------------------------- |------------:|----------:|----------:|-----:|-------:|----------:| | IterationAndSwapping | 828.1 ns | 2.83 ns | 2.65 ns | 1 | 0.0095 | 40 B | | IterationWithDictionaryOpt | 2,968.5 ns | 3.71 ns | 2.89 ns | 2 | 0.0648 | 280 B | | IterationWithDictionary | 3,051.0 ns | 5.13 ns | 4.55 ns | 3 | 0.0992 | 424 B | | ConversionToHashSet | 3,143.7 ns | 17.67 ns | 13.80 ns | 4 | 1.2131 | 5,080 B | | HashingMethod | 3,251.4 ns | 11.96 ns | 11.18 ns | 5 | 1.2131 | 5,080 B | | DistinctLINQMethod | 3,492.4 ns | 66.41 ns | 62.12 ns | 6 | 1.2283 | 5,144 B | | RecursiveMethod | 5,740.9 ns | 22.60 ns | 21.14 ns | 7 | 1.9302 | 8,088 B | | GroupByAndSelectLINQMethod | 7,188.4 ns | 60.13 ns | 50.21 ns | 8 | 1.1826 | 4,976 B | | IterationAndShifting | 18,046.1 ns | 172.29 ns | 152.73 ns | 9 | - | 40 B |
The result shows that IterationAndSwapping()
has a better performance compared to all the other methods based on the average execution time. Also, all the iteration methods require less memory compared to the other methods. Based on the average execution time, IterationAndShifting()
is the worst method by far. Similarly, RecursiveMethod()
is the worst method based on memory allocation and garbage collection.
Conclusion
Now not only do we know how to remove duplicate values from an array, but we also know the performance of each method.