In this article, we are going to learn how to find all the positions of a substring in another string.
So we are trying to find, for example, all the occurrences of the word ‘the’ in a sentence:
“Some consider the occurrences to be the unconscious mind’s attempts to communicate to the conscious mind.”
Searching Substrings to Find All the Positions of a Substring
With the empty project, we can now develop different algorithms and test their speed.
In the course of this article, we will develop six algorithms. First, we will develop an algorithm using the IndexOf()
method. Next, we will involve searching when finding substrings, so we can utilize regular expressions, specifically the Match()
and Matches()
methods from the RegEx
class.
To get a deeper understanding of substring searching, we’ll develop the whole algorithm from scratch by using the String
class. After developing our brute-force algorithm, we’ll discuss a few well-known algorithms for searching text, using one of them. Finally, with .NET 9 on the horizon, we’ll explore a new class that will help us with substring searching.
So, let us start with the first one.
Easy Algorithm Using the IndexOf() Method
Until now, we have not focused solely on searching within a string. We have previously covered the usage of essential methods within a String class. The article shows how to use the method IndexOf()
, which returns the index of the first occurrence of a substring within another string:
public static List<int> FindAll(string text, string searchText) { List<int> positions = []; var index = text.IndexOf(searchText); while (index != -1) { positions.Add(index); index = text.IndexOf(searchText, index + 1); } return positions; }
The FindAll()
method is straightforward. It loops through the text, starting with the last found occurrence and putting each found position into the list.
Issues With This Easy Implementation
If we test this method on our text sentence, it will find all three occurrences of the word the
, so what could be the problem?
In this particular example, there is no problem, but what if we search for the the
word in the following sentence: “The quick brown fox jumps over the lazy dog?”
Our method would find only one occurrence of the word the
even though there are two. So, one issue is the case-sensitive vs. case-insensitive search.
The second issue is not that obvious, and it concerns optimization. Well, the issue is within this line:
index = text.IndexOf(searchText, index + 1);
When we find the substring, we try to find the next occurrence from the next character. But if we have seen the the
word, we can skip the following 3 characters, not only one. So we can do something like that:
index = text.IndexOf(searchText, index + 3);
Or, more generally:
index = text.IndexOf(searchText, index + searchText.Length);
Optimization Is the Root of All Evil
This is the famous quote of a legendary computer scientist, Donald Knuth. And indeed, with our “optimization”, we have introduced a bug into our code!
We find all occurrences of the word the
in our test sentence, but what about the following case?
What if we try to find the substring AA
in the AAAA
text? How many are there?
Our first method would find a substring AA
on indexes 0, 1, and 2, which is the correct answer. However, the second version would see only two occurrences on indexes 0 and 2. This is because, on index 0, we jump straight to index 2 and skip one of the correct answers!
We are not quoting Donald Knuth in vain, as near the end of the article, we will discover one of the algorithms for solving our problem. The algorithm’s name is the acronym KMP, which stands for the Knuth–Morris–Pratt. So Donald Knuth has solved this particular problem we are trying to solve.
Interface for ‘Search All Occurrences of a String’
We have seen that for each algorithm, we will have to provide 3 elements:
internal interface ISearcher { bool CaseSensitive { get; } bool SkipWholeFoundText { get; } List<int> FindAll(string text, string searchText); }
The algorithm needs to implement three things. It must consider whether the search is case-sensitive or insensitive using the property CaseSensitive
.
It must also consider whether we want to optimize our algorithm – which can, in exceptional cases, produce incorrect results. This behavior can be configured by the property SkipWholeFoundText
.
Finally, it must provide the FindAll()
method itself.
Base Class for All the Algorithms
Every algorithm must expose two fields so we can write a base class for our algorithms:
public abstract class SearchBase { protected string SearchText { get; private set; } = string.Empty; protected int SearchTextLength { get; private set; } = 0; protected int SkipSize { get; private set; } = 1; public bool CaseSensitive { get; private set; } public bool SkipWholeFoundText { get; private set; } protected SearchBase(string searchText, bool skipWholeFoundText, bool caseSensitive) { SearchText = searchText; SkipWholeFoundText = skipWholeFoundText; CaseSensitive = caseSensitive; SearchTextLength = SearchText.Length; SkipSize = SkipWholeFoundText ? SearchText.Length : 1; } protected static bool AreEqualCharactersSensitive(char a, char b) => a == b; protected static bool AreEqualCharactersInsensitive(char a, char b) => char.ToUpperInvariant(a) == char.ToUpperInvariant(b); }
Firstly, we define an abstract class SeachBase
which contains a protected constructor to initialize our search implementation. Additionally, we expose the properties CaseSensitive
, SkipWholeFoundText
and additional helper properties SearchText
, SearchTextLength
and SkipSize
. Furthermore, we use the SkipWholeFoundText
property to calculate how far we can jump and store it in the property SkipSize
.
SearchUsingIndexOf Algorithm Revisited
With this new interface and two fields, we have to rewrite our first algorithm:
public class SearchUsingIndexOf : SearchBase, ISearcher { public SearchUsingIndexOf(string searchText, bool skipWholeFoundText, bool caseSensitive) : base(searchText, skipWholeFoundText, caseSensitive) { } public List FindAll(string text) { var spanText = text.AsSpan(); var spanSearch = SearchText.AsSpan(); List positions = []; var offset = 0; var index = spanText.IndexOf(spanSearch, StringComparison); while (index != -1) { positions.Add(index + offset); offset += index + SkipSize; spanText = spanText[(index + SkipSize)..]; index = spanText.IndexOf(spanSearch, StringComparison); } return positions; } private StringComparison StringComparison => CaseSensitive ? StringComparison.Ordinal : StringComparison.OrdinalIgnoreCase; }
Even though the code seems different than the first one, the algorithm’s essence remains the same. But mostly, the changes in the code are because we have optimized our algorithm by converting the stings into ReadOnlySpan<T>
before working on them.
The IndexOf()
method already solves the problem of case-sensitive or case-insensitive search; all we have to do is provide the correct arguments.
The argument for controlling the case sensitivity is in the StringComparison
enumeration, so we provide the StringComparison
property based on CaseSensitive
.
Note: In order to keep the code as short as possible, we do not test the input parameters for null and empty string in any of the presented algorithms!
Using Regular Expression to Find All the Positions of a Substring
When dealing with strings, we can always rely on regular expressions.
Using Match() Method of the RegEx Class
For searching a substring inside a string, the regular expression library has a Match()
method that will return the index of the first found occurrence, just like the IndexOf()
method for string
.
This algorithm can benefit by creating the regular expression in advance so we are redefining the Initialzation()
method.
So, instead of a String
class, we are dealing with a RegEx
class, and instead of the IndexOf()
method, we are dealing with the Match()
method. All the rest is pretty much the same, so the algorithm is the same:
public class SearchUsingRegexWithMatch : SearchBase, ISearcher { private readonly Regex _regex; public SearchUsingRegexWithMatch(string searchText, bool skipWholeFoundText, bool caseSensitive) : base(searchText, skipWholeFoundText, caseSensitive) { _regex = new Regex(SearchText, StringComparison); } public List FindAll(string text) { List positions = []; Match match = _regex.Match(text); while (match.Success) { positions.Add(match.Index); match = _regex.Match(text, match.Index + SkipSize); } return positions; } private RegexOptions StringComparison => CaseSensitive ? RegexOptions.None : RegexOptions.IgnoreCase; }
Regular expressions solve the case sensitivity with the help of the RegexOptions
enumeration. In the Regex algorithm, we cannot set the property SkipWholeFoundText
to false and therefore we throw a NotSupportedException
.
Using Matches() Method of the RegEx Class
Apart from the String
class, which has no method for searching all substrings at once, the RegEx
class does. Regular expressions have a Matches()
method that will return all the occurrences of a substring within a string.
So maybe this is the method we are searching for, and perhaps this is a “one-liner” that will solve our problems?
So, all we have to do is call the Matches()
method:
public class SearchUsingRegexWithMatches : SearchBase, ISearcher { private readonly Regex _regex; public SearchUsingRegexWithMatches(string searchText, bool skipWholeFoundText, bool caseSensitive) : base(searchText, skipWholeFoundText, caseSensitive) { _regex = new Regex(SearchText, CaseSensitive ? RegexOptions.None : RegexOptions.IgnoreCase); } public List FindAll(string text) { if (!SkipWholeFoundText) throw new NotSupportedException("For SearchUsingRegexWithMatches SkipWholeFoundText should be set to true"); return _regex .Matches(text) .Select(match => match.Index) .ToList(); } }
Well, indeed, this is almost a “one-line” method, except it is written in four lines, making it easier to read.
Where is the problem, then? Luckily, we have thought of this before and can test that this method lacks support for special cases where substrings are within substrings. Remember the problem with AAAA
and AA
? This solution would find only two occurrences!
Positive Lookahead
Truthfully, there is an option to find all the occurrences by changing the search string itself. We must tell the regular expression searcher we want to do a ‘positive lookahead’ search. With that change, the string we are searching for would not be AA
but (?=AA)
.
There is yet another consideration with regular expressions. A few characters in regular expressions have a special meaning. So, if we had searched for the string ‘the.’, this would not mean that we are searching for the word ‘the’ followed by a period. It would mean that we are searching for the word ‘the’ followed by any character.
We could remedy all this by preparing the search string, but this is beyond the scope of this article.
As expected and as we will see when benchmarking the algorithm, using regular expression can’t be as efficient as the algorithm that uses only strings. This is reasonable and expected. Regular expressions are so powerful, but here we are using them only for searching strings, which doesn’t make sense.
Algorithms to Find All the Positions of a Substring
We have solutions using .NET classes, but this algorithm is so straightforward that we can write it ourselves.
Brute-Force Algorithm
All we have to do is to traverse the text from beginning to end. If we find the first character of the string we are searching for, we then check the rest of the searched string:
public class SearchUsingBruteForceAlgorithm : SearchBase, ISearcher { public SearchUsingBruteForceAlgorithm(string searchText, bool skipWholeFoundText, bool caseSensitive) : base(searchText, skipWholeFoundText, caseSensitive) { } public List FindAll(string text) { List positions = []; var textSpan = text.AsSpan(); var searchTextSpan = SearchText.AsSpan(); var textLength = textSpan.Length; Func<char, char, bool> AreEqualCharacters = CaseSensitive ? AreEqualCharactersSensitive : AreEqualCharactersInsensitive; int searchPosition = 0; for (var textPosition = 0; textPosition <= textLength - SearchTextLength; textPosition++) { for (searchPosition = 0; searchPosition < SearchTextLength; searchPosition++) { if (!AreEqualCharacters(textSpan[textPosition + searchPosition], searchTextSpan[searchPosition])) break; } if (searchPosition == SearchTextLength) { positions.Add(textPosition); textPosition += SkipSize - 1; } } return positions; } }
In the outer for loop, we do not have to go to the end of the text but to the index on textLength - SearchTextLength
. This is because if we are searching for a string with a length of 12 characters and only 11 characters left, we can stop. The string can’t be in these 11 characters!
The inner loop searches through the searched string, and as soon as the different character is found, we can break it as we have not seen the string.
But if we have reached the end of the searched string, we have found it and can add the index to the results. After that, we can jump to the next character or skip the length of the whole string. Here, we have to use ‘SkipSize—1
‘ because we are in a for loop, and the loop will advance the textPosition
!
And this is pretty much the whole algorithm.
Other Well-Known Algorithms to Find All the Positions of a Substring
As we can imagine, this kind of search is essential as one of the steps for other algorithms. So great minds have thought about that and come up with different algorithms and optimizations, and here we will name a few:
- Knuth-Morris-Pratt (KMP) Algorithm: Utilizes a precomputed “partial match” table (also known as the “failure function”) to skip sections of the text.
- Boyer-Moore Algorithm: Uses two heuristics (a bad character and a good suffix) to skip text sections.
- Rabin-Karp Algorithm: Utilizes a hashing technique to find the substring. It computes a hash of the pattern and compares it with hashes of text substrings.
- Aho-Corasick Algorithm: Constructs a finite state machine from a set of patterns, simultaneously enabling the search for multiple patterns.
Knuth-Morris-Pratt Algorithm to Find All the Positions of a Substring
For comparison with previously developed algorithms, we have coded the Knuth-Morris-Pratt (KMP) Algorithm in C#, and you can find the implementation in the code samples.
The algorithm’s essence is the same as our “brute-force” algorithm, but there is a significant difference. In our algorithm, if we find a partial match, we move to the next character, but the KMP algorithm precalculates how far he can move.
The vital part of the KMP algorithm is creating a precalculated table of movements if substrings are found. So, the KMP algorithm does not need information on whether it should skip one character, a whole word, or something in between. It calculates this in advance and skips an optimum number of characters, so it always returns the correct result.
But this also means that the KMP algorithm loses time at the beginning when calculating this table of possible jumps.
.NET 9.0 to Find All the Positions of a Substring
.NET 8.0 introduced the SearchValues<char>
class, optimized for searching a list of characters in a string.
Using the SearchValues<char>
class, we can search for more than one character in a string simultaneously.
SearchUsingSearchValues Algorithm
.NET 9.0 has gone a step further, and now we can search for more than one string in another string simultaneously.
In .NET 9.0, we can search for the first occurrence of words like [‘Julius
,’ ‘Caesar
,’ ‘Roman
,’ ‘General
‘] in the book Julius Caesar.
We can use that and create an algorithm for searching all substrings using the SearchValues<string>
class:
public class SearchUsingSearchValues : SearchBase, ISearcher { private readonly SearchValues _searchValues; public SearchUsingSearchValues(string searchText, bool skipWholeFoundText, bool caseSensitive) : base(searchText, skipWholeFoundText, caseSensitive) { _searchValues = SearchValues.Create([SearchText], StringComparison); } public List FindAll(string text) { var spanText = text.AsSpan(); List positions = []; var offset = 0; var index = spanText.IndexOfAny(_searchValues); while (index != -1) { positions.Add(index + offset); offset += index + SkipSize; spanText = spanText[(index + SkipSize)..]; index = spanText.IndexOfAny(_searchValues); } return positions; } private StringComparison StringComparison => CaseSensitive ? StringComparison.Ordinal : StringComparison.OrdinalIgnoreCase; }
The algorithm is practically the same as the SearchUsingIndexOf
logical algorithm. We benefit from the fact that we can calculate the search values in the constructor. To use the SearchValues
class, we have to use Span
, so the algorithms are practically the same.
The Problem of Comparing Two Characters to Find All the Positions of a Substring
Our algorithms have not touched the problem of comparing two characters. We are just comparing the characters on equality or converting each character to an upper character and comparing them. But this can be wrong.
The reason is localization/globalization, which are different parts of the same coin. In today’s world, we can’t compare single characters when comparing strings. We are not using just 8 bits for characters as we have used back in the day. Even 2 bytes, the size of a char type in C#, is not enough.
This is not the article where we would dive deep into this kind of problem, but there are great web pages where we can start digging Character encoding in .NET, and How to compare strings in C#. We can learn about code points, surrogate pairs, Rune API, etc.
However, the most crucial information is that comparing two characters in a string is unsafe, as many languages’ rules are much more difficult. So it is best to leave such compartments operating system and, in our case, to .NET runtime.
Implementing that in our algorithm is difficult, but using the IndexOf()
method or SearchValues
method, we can use the enum StringComparison
and proper language-defined character comparisons.
Benchmarks to Find All the Positions of a Substring
As mentioned in the previous chapter, we will use simple character comparisons so algorithms behave the same and we can benchmark them.
Benchmark Setup
But to be fair, we will test these methods on short sentences and the book Julius Caesar, whose text is on the web: Download Shakespeare’s Plays, Sonnets, and Poems.
So, to test different string lengths, we will conduct four different searches. We will:
- (index 0) search ‘the’ inside ‘Some consider the occurrences to be the unconscious mind’s attempts to communicate to the conscious mind.’,
- (index 1) search ‘the’ inside ‘The quick brown fox jumps over the lazy dog?’,
- (index 2) search ‘ABABABABABAB’ inside ‘ABABABABABABABABABABABABABABABABABABABABABABABABABABABABABAB’,
- (index 3) and search ‘Themselves’ inside the whole book Julius Caesar.
Benchmark Results
The results of the benchmark can be challenging to understand, but let’s take a look:
| Method | Index | Mean | Error | StdDev | Ratio | RatioSD | |--------------------------------- |------ |--------------:|-------------:|-------------:|------:|--------:| | BenchmarkSearchUsingIndexOf | 0 | 48.91 ns | 0.992 ns | 1.061 ns | 1.00 | 0.00 | | BenchmarkSearchUsingSearchValues | 0 | 35.41 ns | 0.755 ns | 1.059 ns | 0.73 | 0.03 | | BenchmarkSearchUsingRegexMatch | 0 | 301.53 ns | 6.059 ns | 9.785 ns | 6.07 | 0.24 | | BenchmarkSearchUsingBruteForce | 0 | 371.58 ns | 7.303 ns | 13.537 ns | 7.53 | 0.29 | | BenchmarkSearchUsingKMP | 0 | 314.40 ns | 4.952 ns | 4.632 ns | 6.45 | 0.22 | | | | | | | | | | BenchmarkSearchUsingIndexOf | 1 | 31.15 ns | 0.697 ns | 2.045 ns | 1.00 | 0.00 | | BenchmarkSearchUsingSearchValues | 1 | 22.83 ns | 0.497 ns | 0.958 ns | 0.71 | 0.04 | | BenchmarkSearchUsingRegexMatch | 1 | 126.33 ns | 2.523 ns | 3.853 ns | 3.86 | 0.32 | | BenchmarkSearchUsingBruteForce | 1 | 159.43 ns | 3.176 ns | 5.479 ns | 4.96 | 0.47 | | BenchmarkSearchUsingKMP | 1 | 123.03 ns | 1.123 ns | 0.877 ns | 3.64 | 0.11 | | | | | | | | | | BenchmarkSearchUsingIndexOf | 2 | 300.61 ns | 6.054 ns | 8.488 ns | 1.00 | 0.00 | | BenchmarkSearchUsingSearchValues | 2 | 290.81 ns | 4.867 ns | 4.314 ns | 0.96 | 0.03 | | BenchmarkSearchUsingRegexMatch | 2 | 2,519.86 ns | 50.242 ns | 124.187 ns | 8.36 | 0.59 | | BenchmarkSearchUsingBruteForce | 2 | 987.37 ns | 19.202 ns | 22.113 ns | 3.28 | 0.10 | | BenchmarkSearchUsingKMP | 2 | 245.00 ns | 3.779 ns | 3.535 ns | 0.81 | 0.03 | | | | | | | | | | BenchmarkSearchUsingIndexOf | 3 | 5,132.30 ns | 85.983 ns | 114.785 ns | 1.00 | 0.00 | | BenchmarkSearchUsingSearchValues | 3 | 5,864.47 ns | 112.210 ns | 99.472 ns | 1.13 | 0.03 | | BenchmarkSearchUsingRegexMatch | 3 | 5,112.77 ns | 53.642 ns | 47.552 ns | 0.99 | 0.02 | | BenchmarkSearchUsingBruteForce | 3 | 325,646.96 ns | 6,415.395 ns | 7,637.072 ns | 63.38 | 2.01 | | BenchmarkSearchUsingKMP | 3 | 273,515.17 ns | 5,335.450 ns | 5,930.339 ns | 53.20 | 1.55 |
The KMP algorithm is the best in a specific case where we search for a pattern inside similar patterns. This is because it calculates exactly how many characters it can advance.
But the search for the word Themselves
is deliberate here, as there are many the
words in the English text. So, the brute force algorithm goes backward a lot, as Themselves
and the
share the same three characters!
The KMP algorithm is best when we search different text for the exact keywords, as then we can precalculate the jumps only once.
On average, using a brute-force or KMP algorithm is very slow. This is probably because of inefficient memory access in a high-level .NET environment.
Regular expression methods are expected to be slower than the IndexOf()
solution, as regular expressions are more powerful. However, we still use them to search for a string.
The SearchValues
algorithm and the IndexOf()
algorithm are always quickest, but we have extracted the initialization phase of the SearchValues
algorithm, which would spoil the results. So, the SearchValues
algorithm would be most suitable if we search for the exact string in many texts. Also, we must remember that with SearchValues
, we can search for more than one string simultaneously, but we are using it for only one string!
The ‘ABABABABABAB’ string is deliberately selected just to show the power of the KMP algorithm in exceptional cases.
The most exciting part of the article is the entirely unexpected results of the benchmarks. One would expect that the KMP algorithm would be the best as it is optimized and was developed by great minds. However, this algorithm is unsuitable for a .NET environment as we do not have direct memory access, unlike in lower computer languages like C/C++ or Rust.
Benchmark Results for Regex Classes
We discovered that the SearchUsingRegexWithMatches
couldn’t find a few valid results as it skipped all the words found. So, to fairly measure this algorithm, we have compared it with the SearchUsingRegexWithMatch
, where both skip the characters of the searched string:
| Method | Index | Mean | Error | StdDev | Ratio | RatioSD | |--------------------------------- |------ |-----------:|---------:|----------:|------:|--------:| | BenchmarkSearchUsingRegexMatch | 0 | 318.3 ns | 6.38 ns | 11.67 ns | 1.00 | 0.00 | | BenchmarkSearchUsingRegexMatches | 0 | 370.4 ns | 7.38 ns | 14.22 ns | 1.17 | 0.06 | | | | | | | | | | BenchmarkSearchUsingRegexMatch | 1 | 128.5 ns | 2.59 ns | 4.93 ns | 1.00 | 0.00 | | BenchmarkSearchUsingRegexMatches | 1 | 186.7 ns | 3.77 ns | 9.73 ns | 1.46 | 0.09 | | | | | | | | | | BenchmarkSearchUsingRegexMatch | 2 | 2,585.4 ns | 50.34 ns | 79.85 ns | 1.00 | 0.00 | | BenchmarkSearchUsingRegexMatches | 2 | 621.2 ns | 12.35 ns | 30.05 ns | 0.24 | 0.01 | | | | | | | | | | BenchmarkSearchUsingRegexMatch | 3 | 5,177.1 ns | 91.40 ns | 85.50 ns | 1.00 | 0.00 | | BenchmarkSearchUsingRegexMatches | 3 | 5,187.3 ns | 97.83 ns | 100.46 ns | 1.00 | 0.03 |
We can see that the SearchUsingRegexWithMatches
algorithm is faster, so if we know that we can skip the whole words or prepare the search expression (Positive Lookahead) we can use it instead of SearchUsingRegexWithMatch
.
Conclusion
In conclusion, we would advise using the IndexOf() method and maybe StringComparison.CurrentCulture / StringComparison.CurrentCultureIgnoreCase. We have not benchmarked this as other algorithms do not support it. It is quite a lot slower, but it may be the right tool in some scenarios.
If we search for more than one string, we use the SearchUsingSearchValues algorithm.
When searching for an exact string, use the IndexOf Algorithm. If searching for regular expressions, use the RegEx algorithm. Searching for more than one string at a time, use the SearchValues algorithm.