Using Trie Class for Efficient Text Pattern Searching in C#

In this article, we will explore the Trie algorithm and how to use the Trie class in C# for efficient text pattern searching. The naive approach of searching through the text character by character can be inefficient, especially for longer patterns and larger texts. The Trie algorithm allows for an efficient approach.

To download the source code for this article, you can visit our GitHub repository.

Let’s begin.

What Is a Trie?

A Trie (pronounced “try”) is a tree-like data structure that is often used to store and retrieve strings quickly.

Support Code Maze on Patreon to get rid of ads and get the best discounts on our products!

At its core, a trie is made up of nodes that represent individual characters in a string. The root node represents the empty string, and each child node represents a different character that can follow the string represented by its parent.

One advantage of a trie is its ability to determine rapidly whether a given string is present in the set of strings it represents. A string is represented by a unique path through the trie, making it easy to determine its presence or absence.

Another advantage of a trie is its ability to find all strings that have a given prefix quickly. All strings sharing a common prefix have the same path through the trie up to the prefix’s endpoint.

Let’s take “cat”, and “car” as an example. When storing in a trie, the root node represents the empty string. The first character in both the strings is “c”, so we create a child node under the root node to represent this character. Next, we create a new node that represents the character “a”. Finally, we create two child nodes from the “a” node to represent the characters “t” and “r”.

Now, this makes it easy for us to determine if a string e.g. “can” is present in the given strings. Also, finding all strings with a “ca” prefix is a matter of counting the child nodes for the “a” node.

Implement a Trie Class in C#

Let’s implement the Trie data structure. A trie is a collection of nodes, so let’s start by creating a TrieNode class:

public class TrieNode
{
    public bool IsWord { get; set; }
    public Dictionary<char, TrieNode> Children { get; } = new Dictionary<char, TrieNode>();
}

Here, the TrieNode class stores a boolean flag indicating whether the node represents the end of a word in the IsWord property, as well as a dictionary of child nodes.

Now, let’s create a Trie class that stores a reference to the root node and provides methods for inserting and searching words:

public class Trie
{
    private readonly TrieNode _root = new TrieNode();

    public void AddWord(string word)
    {
        var node = _root;

        foreach (char c in word)
        {
            if (!node.Children.ContainsKey(c))
                node.Children[c] = new TrieNode();

            node = node.Children[c];
        }

        node.IsWord = true;
    }

    public bool Search(string word)
    {
        var node = _root;

        foreach (char c in word)
        {
            if (!node.Children.ContainsKey(c))
                return false;

            node = node.Children[c];
        }

        return node.IsWord;
    }
}

Here, we have two methods, AddWord(), for inserting a new word, and Search(), for searching a word within the trie.

To insert a word, we start at the root node and iterate through each character in the word. For each character, we check if there is a child node representing that character. If not, we create a new child node and add it to the dictionary of children. Then, we move to the child node and repeat the process for the next character. Once we reach the end of the word, we set the IsWord property on the final node to true.

To search for a word, we start at the root node and iterate through each character in the word. For each character, we check if there is a child node representing that character. If not, we return false to indicate that the word is not in the trie. If we reach the end of the word and the IsWord property on the final node is true, we return true to indicate that the word is in the trie.

Now, that we are done with the creation of classes, let’s look at how to use this Trie class for pattern searching.

Use the Trie Class for Pattern Searching

To search for a pattern, we need to create the Trie data structure and insert words into it:

var trie = new Trie();
var text = "The quick brown fox jumps over the lazy dog";
            
foreach (var word in text.Split(' '))
{
    trie.AddWord(word);
}

var expected = true;
var actual = trie.Search("quick");

Assert.Equal(expected, actual);

Here, we first create a Trie instance and split the text into words using the Split() method. Then, we add each word to the Trie using the AddWord() method.

Next, we define a pattern that we want to search for in the text. Finally, we search the Trie for the pattern using the Search() method, and verify it exists.

Trie Class vs .NET Collections

.NET collections like List, Dictionary, and HashSet are very useful for storing and retrieving data, but when it comes to text pattern searching, they may not be the best choice.

The main reason why we should favor Trie over .NET collections is that Trie is specifically designed for text pattern searching. It provides a way to efficiently store and retrieve strings based on their prefixes, which makes it ideal for text search applications.

.NET collections, on the other hand, are general-purpose data structures that are not specifically designed for text pattern searching. While we can use them for text search applications, they may not be as efficient as Trie.

Let’s implement pattern searching using the .NET collections and understand how Trie has an advantage over them.

Pattern Searching With HashSet<T>

A HashSet is a collection that allows for constant-time lookup, insertion, and deletion of elements:

public static bool Search(string text, string pattern)
{
    if (string.IsNullOrEmpty(text) || string.IsNullOrEmpty(pattern))
    {
        return false;
    }

    var patternSet = new HashSet<string> { pattern };
    var patternLength = pattern.Length;

    for (int i = 0; i <= text.Length - patternLength; i++)
    {
        var subString = text.Substring(i, patternLength);

        if (patternSet.Contains(subString))
        {
            return true;
        }
    }

    return false;
}

Here, we define a Search() method that takes a text string and a pattern string and returns a bool value indicating whether the pattern occurs in the text.

If we find a match between a substring and any of the patterns in the HashSet<string>, we return true. If we have iterated through all substrings and have not found any matches, we return false.

HashSet<T> is a good choice if we only need to determine whether a pattern appears in a text and don’t need to know its location or frequency, as it provides constant-time lookup.

List<T> Pattern Searching

The List class in C# is a dynamic collection that allows us to add or remove elements at runtime. It is useful for storing and manipulating collections of data.

We can implement a similar search method to a HashSet:

public static bool Search(string text, string pattern)
{
    if (string.IsNullOrEmpty(text) || string.IsNullOrEmpty(pattern) || text.Length < pattern.Length)
    { 
        return false; 
    }

    var substrings = new List<string>();
    for (int i = 0; i <= text.Length - pattern.Length; i++)
    {
        substrings.Add(text.Substring(i, pattern.Length));
    }

    return substrings.Contains(pattern);
            
}

First, we create an empty list of substrings. Then, we iterate through each substring of the text string that is the same length as the pattern and add it to the list of substrings.

Finally, we check whether the list of substrings contains the pattern string. If it does, we return true, otherwise, we return false.

A List<T> is a preferred choice in scenarios where the order of the elements is important, or where we want to preserve the duplicate elements. For example, if we search for a pattern in a text and want to return all occurrences of the pattern.

Pattern Searching Using SortedList<TKey, TValue>

The SortedList class in C# is a collection that stores key/value pairs in ascending order by key. It is useful for maintaining sorted data and allows for fast searching of values by key.

Let’s see how to implement search functionality with a SortedList:

public static bool Search(string text, string pattern)
{
    if (string.IsNullOrEmpty(text) || string.IsNullOrEmpty(pattern) || text.Length < pattern.Length)
    {
        return false;
    }

    var substrings = new SortedList<string, bool>();
    for (int i = 0; i <= text.Length - pattern.Length; i++)
    {
        substrings[text.Substring(i, pattern.Length)] = true;
    }

    return substrings.ContainsKey(pattern);
}

This example is very similar to our implementation using the List<T> class. The only difference is that we use the SortedList<string, bool> instead of List<string>.

A SortedList<TKey, TValue> is preferred in scenarios where we need to store and access elements in sorted order based on their key values. For example, a collection of items with a unique key value where we need to access these items based on their key values in sorted order.

Now that we have all our candidates for pattern searching implemented, let’s look at how they perform compared to each other.

Performance Comparison

Let’s create a PatternSearchBenchmark class to compare the performance of all the patterns searching implementations using BenchmarkDotNet:

public class PatternSearchBenchmark
{
    private const string _text = "abcdefghijklmnopqrstuvwxyz";
    private const string _pattern = "cd";
    private const int _iterations = 1000000;
}

Using this class, we will perform the pattern-searching operation multiple times with all the different implementations we saw earlier.

Next, in the same class, we will execute each of the previous search methods. You can find the entire implementation here.

The PatternSearchBenchmark class runs all the pattern-searching methods 1000000 times and compares their performance:

var summary = BenchmarkRunner.Run<PatternSearchBenchmark>();
Console.WriteLine(summary);

Let’s assess the performance results:

|           Method |         Mean |      Error |     StdDev |
|----------------- |-------------:|-----------:|-----------:|
|       TrieSearch |     5.298 ms |  0.0428 ms |  0.0401 ms |
|    HashSetSearch |    76.801 ms |  1.5299 ms |  2.1447 ms |
|       ListSearch |   338.358 ms |  6.6243 ms |  8.3776 ms |
| SortedListSearch | 4,222.082 ms | 12.6473 ms | 11.8303 ms |

It’s clear that searching using Trie was by far the fastest method in terms of execution time, with a mean time of only 5.298 ms.

In .NET collections, searching using HashSet<T> performed best and can be a good choice where duplicates need to be avoided. List<T> performed worse than HashSet in terms of mean execution speed.

On the other hand, searching using SortedList<T> was the slowest method, with a mean execution time of over 4 seconds.

Conclusion

In this article, we explored the Trie algorithm for efficient text pattern searching. Also, we looked at how to implement the Trie class in C# for this purpose. We learned about other .NET collections that we can use to search patterns in a given text and the advantage of using Trie over them.

Liked it? Take a second to support Code Maze on Patreon and get the ad free reading experience!

Leave a reply Cancel reply

Courses – Code Maze

Ad 1

Ad 2

Ad 3

Ad 4