In this article, we are going to share the different methods on how to remove duplicates from an array in C#. In addition, we are going to compare these methods to check their performance.
Arrays contain multiple values of the same type. It’s a common occurrence to have some duplicate elements in an array. Although, in some cases, we want the elements to be unique.
We can achieve this manually or by using some of the .NET inbuilt methods.
Let’s start.
Remove Duplicates Using the LINQ Distinct Method
To begin with, let’s create an array containing some duplicate values. We will use this array for all the methods defined:
string[] arrayWithDuplicateValues = new string[] { "Lovely", "Lovely", "Boy", "Boy" };
For the duplicate values in our array, we can use the LINQ Distinct()
method to remove them:
// before: {"Lovely", "Lovely", "Boy", "Boy" } var distinctArray = arrayWithDuplicateValues.Distinct().ToArray(); // after: {"Lovely", "Boy" }
This is done by invoking the Distinct()
method on the array. The response is an enumerable object. ToArray()
converts the response to an array with unique values.
Remove Duplicates Using HashSet
HashSet is an unordered collection of a specific type. We can remove duplicates from an array by casting it to a HashSet
and then converting it back to an array:
var distinctArray = arrayWithDuplicateValues.ToHashSet().ToArray();
We use ToHashSet()
on the array to remove the duplicate values. The ToHashSet()
method returns an enumerable collection. ToArray()
will alter the response to an array.
Using the HashSet Constructor
Similarly, one of the constructors for HashSet
accepts any collection of values of a particular type:
var hashSet = new HashSet<T>(arrayWithDuplicateValues); var distinctArray = hashSet.ToArray();
By default, the compiler will eliminate the duplicate values while constructing the HashSet.
Remove Duplicates Using LINQ GroupBy and Select Method
Another way to remove duplicate elements in an array is by combining the GroupBy()
and Select()
methods:
var distinctArray = arrayWithDuplicateValues.GroupBy(d => d).Select(d => d.First()).ToArray();
We use GroupBy()
to organize the elements into groups, and then we select the first item from each of the groups and convert them to an array.
Loop Through the Array and Remove Elements Manually
We can loop through the array and remove the items that occur more than once. We can use the initial array but shift the elements in the array once we hit a duplicate or use a temporary collection to hold the unique elements:
var size = arrayWithDuplicateValues.Length; for (int i = 0; i < size; i++) { for (int j = i + 1; j < size; j++) { if (arrayWithDuplicateValues[i] == arrayWithDuplicateValues[j]) { for (int k = j; k < size - 1; k++) { arrayWithDuplicateValues[k] = arrayWithDuplicateValues[k + 1]; } j--; size--; } } } var distinctArray = arrayWithDuplicateValues[0..size];
The if statement in the second loop compares all elements in the array with each other.
Inside the if block, we iterate through the entire array from the duplicate element index and shift the elements to one position left. The order of elements in our array changes during shifting. Once we are done with our iteration, the last element becomes obsolete. Hence, we will reduce the size of the array by 1 and cut it when the iteration is over.
This is far from optimal, but we’re going to see what the benchmark show at the end.
Using Swapping Instead of Shifting Elements to Remove Them
We can use a swapping algorithm instead of the shifting one (the one we previously used) to increase the performance of our code:
public string[] IterationAndSwappingElements(string[] arrayWithDuplicateValues) { var size = arrayWithDuplicateValues.Length; for (int i = 0; i < size; i++) { for (int j = i + 1; j < size; j++) { if (arrayWithDuplicateValues[i] == arrayWithDuplicateValues[j]) { size--; arrayWithDuplicateValues[j] = arrayWithDuplicateValues[size]; j--; } } } return arrayWithDuplicateValues[0..size]; }
This is a similar code to the previous one, but as you will see in the benchmark a faster one. Thanks a lot, James Curran for suggesting this (you can read more about it in the comment section).
Utilizing the Dictionary Key to Remove Duplicates
We can also use a dictionary within the for loop to hold the unique elements in our array:
var dic = new Dictionary<string, int>(); foreach (var s in arrayWithDuplicateValues) { dic.TryAdd(s, 1); } var distinctArray = dic.Select(x => x.Key.ToString()).ToArray();
The dictionary keys represent all the unique elements in the array. We’re utilizing one of the important properties of the Dictionary collection – all the keys in the dictionary must be unique.
Improving This Code
If we just modify the last code line to use the Keys property, we can improve performance:
public string[] IterationWithDictionaryOpt(string[] arrayWithDuplicateValues) { var dic = new Dictionary<string, int>(); foreach (var s in arrayWithDuplicateValues) { dic.TryAdd(s, 1); } return dic.Keys.ToArray(); }
We add the implementation in a different method so we could compare them in our benchmark test.
Removing Elements by Using Recursion
Similarly, we can also construct a recursive method to remove duplicates in an array:
public string[] RecursiveMethod(string[] arrayWithDuplicateValues, List<string>? mem = default, int index = 0) { if (mem == null) { mem = new List<string>(); } if (index >= arrayWithDuplicateValues.Length) { return arrayWithDuplicateValues; } if (mem.IndexOf(arrayWithDuplicateValues[index]) < 0) { mem.Add(arrayWithDuplicateValues[index]); } RecursiveMethod(arrayWithDuplicateValues, mem, index + 1); return mem.ToArray(); }
While the function is calling itself, we are using a list to hold all unique values in the array. During this process, we only add values missing to the list. The function stops calling itself when index
is equal to the length of the array.
Performance Metrics
Now, we are going to compare the performance of the different methods discussed so far:
| Method | Mean | Error | StdDev | Rank | Gen 0 | Allocated | |--------------------------- |------------:|----------:|----------:|-----:|-------:|----------:| | IterationAndSwapping | 828.1 ns | 2.83 ns | 2.65 ns | 1 | 0.0095 | 40 B | | IterationWithDictionaryOpt | 2,968.5 ns | 3.71 ns | 2.89 ns | 2 | 0.0648 | 280 B | | IterationWithDictionary | 3,051.0 ns | 5.13 ns | 4.55 ns | 3 | 0.0992 | 424 B | | ConversionToHashSet | 3,143.7 ns | 17.67 ns | 13.80 ns | 4 | 1.2131 | 5,080 B | | HashingMethod | 3,251.4 ns | 11.96 ns | 11.18 ns | 5 | 1.2131 | 5,080 B | | DistinctLINQMethod | 3,492.4 ns | 66.41 ns | 62.12 ns | 6 | 1.2283 | 5,144 B | | RecursiveMethod | 5,740.9 ns | 22.60 ns | 21.14 ns | 7 | 1.9302 | 8,088 B | | GroupByAndSelectLINQMethod | 7,188.4 ns | 60.13 ns | 50.21 ns | 8 | 1.1826 | 4,976 B | | IterationAndShifting | 18,046.1 ns | 172.29 ns | 152.73 ns | 9 | - | 40 B |
The result shows that IterationAndSwapping()
has a better performance compared to all the other methods based on the average execution time. Also, all the iteration methods require less memory compared to the other methods. Based on the average execution time, IterationAndShifting()
is the worst method by far. Similarly, RecursiveMethod()
is the worst method based on memory allocation and garbage collection.
Conclusion
Now not only do we know how to remove duplicate values from an array, but we also know the performance of each method.
I’m quite surprised that HashSet variant takes this much memory… That is weird.
However this benchmark rather is silly, because
From the performance complexity point of view…
The IterationAndSwapping has O(n^2) complexity = N * N search
And the ConversionToHashSet has O(n) complexity = N * O(1) search
And from the memory point of view…
The algorithms are not created equal!
Few only reads the input array,
but others with less memory footprint are mutating the input array.
I’d say that is cheating 🙂
From the benchmarking point of view…
I think using string as elements here is not nice. Comparing 2 strings can be non-trivial operation, so using ints or some other primite type would be much more straightforward way for this benchmark.
If minimal memory footprint is required I guess you could optimize the HashSet variant further by implementing the hashing yourself. It is weird that HashSet takes so much memory…
I created a PR for making this benchmark a much more precise than those 170 string items from before.
And the results speaks definitely for the hashing… 10000 ints 140us vs 19000us…
https://github.com/CodeMazeBlog/CodeMazeGuides/pull/1128
https://pastebin.com/QvxX1AHS
| Method | N | PercentDuplicates | Mean |
| ConversionToHashSet | 10000 | 0 | 131.517 us |
| ConversionToHashSet | 10000 | 25 | 141.748 us |
| ConversionToHashSet | 10000 | 50 | 151.430 us |
| ConversionToHashSet | 10000 | 100 | 153.863 us |
| ConversionToHashSet | 10000 | 75 | 163.419 us |
| IterationAndSwapping | 10000 | 100 | 16,306.401 us |
| IterationAndSwapping | 10000 | 75 | 17,706.000 us |
| IterationAndSwapping | 10000 | 50 | 19,340.314 us |
| IterationAndSwapping | 10000 | 25 | 20,492.625 us |
| IterationAndSwapping | 10000 | 0 | 21,640.652 us |
I was rather surprised to see IterationWithDictionary take the top spot. I would think it’s essentially ConversionToHashSet with some extra overhead. But, anyway, it could be improved. Replacing the return line with just
return dic.Keys.ToArray();
reduces it’s time by about 10%, and memory by about 33%BUT, the biggest effect I was able to achieve was a simple change to IterationAndShifting which moved from a distant last place to way out in front — it literally cut the time by over 97%!!
Instead of shifting everything down, just replace the duplicate item with the last element of the array.
My changes are at : jamescurran/CodeMazeGuides: The main repository for all the Code Maze guides (github.com)
I’ve sent a pull request.
Forgot to include this
(unfortunately, I can’t find an option to make WordPress use a monospaced font.)
Thank you very much, James, for your suggestions. I’ve verified the PR, and it looks great. We will accept it and also modify this article a bit. Also, if you want, we always have open positions for authors (a paid position, remote), so if you are interested to share knowledge with our readers, everyone would benefit from that 🙂
I was thinking about it (I first saw the request for authors when I was trying to figure out how to log onto this site), before I posted my comment).
But I’m not that good at generating original content (As you can tell from my blog which has six new articles in the last four years). I’m better at looking at other work and finding ways to improve it.
🙂 Okay. Thank you for the reply. Of course, if you change your mind, we are here.
James i have gone through above sol none of them working .for example there is nothing
arrayWithDuplicateValues[0..size]; in c# like this also there is nothing called TryAdd.Not sure if i am wrong somewhere