Array

Remove Duplicates From a C# Array

In this article, we are going to share the different methods on how to remove duplicates from an array in C#. In addition, we are going to compare these methods to check their performance.

Arrays contain multiple values of the same type. It’s a common occurrence to have some duplicate elements in an array. Although, in some cases, we want the elements to be unique.

We can achieve this manually or by using some of the .NET inbuilt methods.

To download the source code for this article, you can visit our GitHub repository.

Let’s start.

Remove Duplicates Using the LINQ Distinct Method

To begin with, let’s create an array containing some duplicate values. We will use this array for all the methods defined:

string[] arrayWithDuplicateValues = new string[] { "Lovely", "Lovely", "Boy", "Boy" };

For the duplicate values in our array, we can use the LINQ Distinct() method to remove them:

// before: {"Lovely", "Lovely", "Boy", "Boy" }
var distinctArray = arrayWithDuplicateValues.Distinct().ToArray(); // after:  {"Lovely", "Boy" }

This is done by invoking the Distinct() method on the array. The response is an enumerable object. ToArray() converts the response to an array with unique values.

Remove Duplicates Using HashSet

HashSet is an unordered collection of a specific type. We can remove duplicates from an array by casting it to a HashSet and then converting it back to an array: 

var distinctArray = arrayWithDuplicateValues.ToHashSet().ToArray();

We use ToHashSet() on the array to remove the duplicate values. The ToHashSet() method returns an enumerable collection. ToArray() will alter the response to an array.

Using the HashSet Constructor

Similarly, one of the constructors for HashSet accepts any collection of values of a particular type:

var hashSet = new HashSet<T>(arrayWithDuplicateValues); 
var distinctArray = hashSet.ToArray();

By default, the compiler will eliminate the duplicate values while constructing the HashSet.

Remove Duplicates Using LINQ GroupBy and Select Method

Another way to remove duplicate elements in an array is by combining the GroupBy() and Select() methods:

var distinctArray = arrayWithDuplicateValues.GroupBy(d => d).Select(d => d.First()).ToArray();

We use GroupBy() to organize the elements into groups, and then we select the first item from each of the groups and convert them to an array. 

Loop Through the Array and Remove Elements Manually

We can loop through the array and remove the items that occur more than once. We can use the initial array but shift the elements in the array once we hit a duplicate or use a temporary collection to hold the unique elements:

var size = arrayWithDuplicateValues.Length;

for (int i = 0; i < size; i++)
{
    for (int j = i + 1; j < size; j++)
    {
        if (arrayWithDuplicateValues[i] == arrayWithDuplicateValues[j])
        {
            for (int k = j; k < size - 1; k++)
            {
                arrayWithDuplicateValues[k] = arrayWithDuplicateValues[k + 1];
            }
            j--;
            size--;
        }
    }

}

var distinctArray = arrayWithDuplicateValues[0..size]; 

The if statement in the second loop compares all elements in the array with each other.

Inside the if block, we iterate through the entire array from the duplicate element index and shift the elements to one position left. The order of elements in our array changes during shifting. Once we are done with our iteration, the last element becomes obsolete. Hence, we will reduce the size of the array by 1 and cut it when the iteration is over.

This is far from optimal, but we’re going to see what the benchmark show at the end.

Using Swapping Instead of  Shifting Elements to Remove Them

We can use a swapping algorithm instead of the shifting one (the one we previously used) to increase the performance of our code:

public string[] IterationAndSwappingElements(string[] arrayWithDuplicateValues)
{
    var size = arrayWithDuplicateValues.Length;

    for (int i = 0; i < size; i++)
    {
        for (int j = i + 1; j < size; j++)
        {
            if (arrayWithDuplicateValues[i] == arrayWithDuplicateValues[j])
            {
                size--;
                arrayWithDuplicateValues[j] = arrayWithDuplicateValues[size];
                j--;
            }
        }
    }

    return arrayWithDuplicateValues[0..size];
}

This is a similar code to the previous one, but as you will see in the benchmark a faster one. Thanks a lot, James Curran for suggesting this (you can read more about it in the comment section). 

Utilizing the Dictionary Key to Remove Duplicates

We can also use a dictionary within the for loop to hold the unique elements in our array:

var dic = new Dictionary<string, int>();

foreach (var s in arrayWithDuplicateValues)
{
    dic.TryAdd(s, 1);
}

var distinctArray = dic.Select(x => x.Key.ToString()).ToArray();

The dictionary keys represent all the unique elements in the array. We’re utilizing one of the important properties of the Dictionary collection – all the keys in the dictionary must be unique.

Improving This Code

If we just modify the last code line to use the Keys property, we can improve performance:

public string[] IterationWithDictionaryOpt(string[] arrayWithDuplicateValues)
{
    var dic = new Dictionary<string, int>();

    foreach (var s in arrayWithDuplicateValues)
    {
        dic.TryAdd(s, 1);
    }

    return dic.Keys.ToArray();
}

We add the implementation in a different method so we could compare them in our benchmark test.

Removing Elements by Using Recursion

Similarly, we can also construct a recursive method to remove duplicates in an array:

public string[] RecursiveMethod(string[] arrayWithDuplicateValues, List<string>? mem = default, int index = 0)
{
    if (mem == null)
    {
        mem = new List<string>();
    }

    if (index >= arrayWithDuplicateValues.Length)
    {
        return arrayWithDuplicateValues;
    }

    if (mem.IndexOf(arrayWithDuplicateValues[index]) < 0)
    {
        mem.Add(arrayWithDuplicateValues[index]);
    }

    RecursiveMethod(arrayWithDuplicateValues, mem, index + 1);

    return mem.ToArray();
}

While the function is calling itself, we are using a list to hold all unique values in the array. During this process, we only add values missing to the list. The function stops calling itself when index is equal to the length of the array.

Performance Metrics

Now, we are going to compare the performance of the different methods discussed so far:

|                     Method |        Mean |     Error |    StdDev | Rank |  Gen 0 | Allocated |
|--------------------------- |------------:|----------:|----------:|-----:|-------:|----------:|
|       IterationAndSwapping |    828.1 ns |   2.83 ns |   2.65 ns |    1 | 0.0095 |      40 B |
| IterationWithDictionaryOpt |  2,968.5 ns |   3.71 ns |   2.89 ns |    2 | 0.0648 |     280 B |
|    IterationWithDictionary |  3,051.0 ns |   5.13 ns |   4.55 ns |    3 | 0.0992 |     424 B |
|        ConversionToHashSet |  3,143.7 ns |  17.67 ns |  13.80 ns |    4 | 1.2131 |   5,080 B |
|              HashingMethod |  3,251.4 ns |  11.96 ns |  11.18 ns |    5 | 1.2131 |   5,080 B |
|         DistinctLINQMethod |  3,492.4 ns |  66.41 ns |  62.12 ns |    6 | 1.2283 |   5,144 B |
|            RecursiveMethod |  5,740.9 ns |  22.60 ns |  21.14 ns |    7 | 1.9302 |   8,088 B |
| GroupByAndSelectLINQMethod |  7,188.4 ns |  60.13 ns |  50.21 ns |    8 | 1.1826 |   4,976 B |
|       IterationAndShifting | 18,046.1 ns | 172.29 ns | 152.73 ns |    9 |      - |      40 B |

The result shows that IterationAndSwapping() has a better performance compared to all the other methods based on the average execution time. Also, all the iteration methods require less memory compared to the other methods. Based on the average execution time, IterationAndShifting() is the worst method by far. Similarly, RecursiveMethod() is the worst method based on memory allocation and garbage collection. 

Conclusion

Now not only do we know how to remove duplicate values from an array, but we also know the performance of each method.

Code Maze

View Comments

  • I was rather surprised to see IterationWithDictionary take the top spot. I would think it's essentially ConversionToHashSet with some extra overhead. But, anyway, it could be improved. Replacing the return line with just return dic.Keys.ToArray(); reduces it's time by about 10%, and memory by about 33%

    BUT, the biggest effect I was able to achieve was a simple change to IterationAndShifting which moved from a distant last place to way out in front -- it literally cut the time by over 97%!!

    Instead of shifting everything down, just replace the duplicate item with the last element of the array.

            if (arrayWithDuplicateValues[i] == arrayWithDuplicateValues[j])
            {
              size--;
              arrayWithDuplicateValues[j] = arrayWithDuplicateValues[size];
              j--;
            }
    

    My changes are at : jamescurran/CodeMazeGuides: The main repository for all the Code Maze guides (github.com)

    I've sent a pull request.

    • Forgot to include this

      |           Method |    Mean |   Error |  StdDev | Rank | Gen 0 | Gen 1 | Allocated |
      |--------------------------- |------------:|----------:|----------:|-----:|-------:|-------:|----------:|
      |    IterationAndSwapping |  539.9 ns |  9.92 ns |  9.28 ns |  1 | 0.0057 |   - |   40 B |
      | IterationWithDictionaryOpt | 2,264.3 ns | 15.23 ns | 12.72 ns |  2 | 0.0420 |   - |   280 B |
      |  IterationWithDictionary | 2,400.5 ns | 24.11 ns | 20.14 ns |  3 | 0.0648 |   - |   424 B |
      |    ConversionToHashSet | 2,679.4 ns | 52.43 ns | 81.63 ns |  4 | 0.8087 |   - |  5,080 B |
      |       HashingMethod | 2,682.5 ns | 46.32 ns | 41.06 ns |  4 | 0.8087 |   - |  5,080 B |
      |     DistinctLINQMethod | 2,757.4 ns | 54.44 ns | 106.18 ns |  4 | 0.8163 |   - |  5,144 B |
      | GroupByAndSelectLINQMethod | 5,623.3 ns | 100.15 ns | 102.85 ns |  5 | 0.7858 | 0.0076 |  4,976 B |
      |      RecursiveMethod | 6,356.3 ns | 112.21 ns | 149.80 ns |  6 | 1.2817 |   - |  8,088 B |
      |    IterationAndShifting | 19,631.9 ns | 316.15 ns | 280.26 ns |  7 |   - |   - |   40 B |
      

      (unfortunately, I can't find an option to make WordPress use a monospaced font.)

      • Thank you very much, James, for your suggestions. I've verified the PR, and it looks great. We will accept it and also modify this article a bit. Also, if you want, we always have open positions for authors (a paid position, remote), so if you are interested to share knowledge with our readers, everyone would benefit from that :)

        • I was thinking about it (I first saw the request for authors when I was trying to figure out how to log onto this site), before I posted my comment).
          But I'm not that good at generating original content (As you can tell from my blog which has six new articles in the last four years). I'm better at looking at other work and finding ways to improve it.

          • :) Okay. Thank you for the reply. Of course, if you change your mind, we are here.

Share
Published by
Code Maze

Recent Posts

HttpClient vs RestSharp – Which One to Use in .NET

HttpClient and RestSharp are HTTP Client libraries that we can use to consume APIs. Working…

Updated Date Jul 7, 2022

Testing Repository Pattern Using Entity Framework

Unit Testing is extremely important for creating robust software. It's very simple in principle but…

Updated Date Jul 6, 2022

Shell Sort in C#

Have you ever needed to sort a list of items, but didn't want to use…

Updated Date Jul 5, 2022

How to Resolve Instances With ASP.NET Core DI

In ASP.NET Core dependency injection, we usually register injectable dependencies at the start of our…

Jul 4, 2022

Ranges and Indices in C#

In this article, we are going to learn more about ranges and indices in C#,…

Updated Date Jul 2, 2022

Code Maze Weekly #128

Issue #128 of the Code Maze weekly. Check out what's new this week and enjoy…

Updated Date Jul 1, 2022