This article will explore the quickest method to find and extract a number from a string in C#. We will discuss and implement different techniques for accomplishing this task. Subsequently, we will evaluate the performance of these techniques using the BenchmarkDotNet library.
Let’s dive in.
Find and Extract a Number From a String Using Regular Expressions
Regular expressions are a powerful tool for searching and parsing text. With the introduction of .NET 7 the ability to source generate regular expression was added allowing for the benefit of compiled regular expressions without the runtime penalty. Since our focus in this article is performance, we will take advantage of the source generator to improve our regex method performance:
[GeneratedRegex(@"-?\d+(\.\d+)?")] private static partial Regex NumberRegex(); public static string ExtractNumberUsingRegEx(string inputString) { var extractedNumbers = new List<double>(); foreach (Match match in NumberRegex().Matches(inputString)) { if (double.TryParse(match.Value, out var parsedNumber)) { extractedNumbers.Add(parsedNumber); } } return string.Join(",", extractedNumbers); }
Here, the [GeneratedRegex]
attribute indicates that the NumberRegex
method has been generated using source generation, specifically designed to identify decimal numbers, complete with an optional minus sign (-?\d+(.\d+)?)
. Subsequently, we use the Regex.Matches()
method to locate all matches of the pattern within the input string, resulting in a collection of matches. We attempt to parse each match in the MatchCollection
, returning a concatenation of any valid numbers found.
Using LINQ to Find and Extract a Number From a String
Another approach to extracting numbers from a string involves the char.IsBetween()
method in conjunction with LINQ (Language Integrated Query). Instead of iterating through each character in the input string and checking if it’s a digit, we can utilize LINQ and the char.IsBetween()
method to achieve the same result more efficiently.
This lets us create a more streamlined approach to filtering out non-numeric characters from the input string:
public static string ExtractNumbersUsingLinq(string inputString) { return string.Join(",", new string(inputString .Where(c => char.IsBetween(c, '0', '9') || c == '.' || c == '-' || char.IsWhiteSpace(c)) .ToArray()).Split((char[]?)null, StringSplitOptions.RemoveEmptyEntries)); }
We use LINQ to filter out digits by employing the Where
extension method with the char.IsBetween
predicate. After filtering, we convert the digits back into a string. This approach eliminates manual iteration over each character, resulting in cleaner and more concise code.
Please be aware that while the char.IsDigit()
method is an option, we are opting not to use it for several reasons:
Initially, it verifies if a character belongs to the ASCII or extended ASCII code sets. Following that, it assesses whether the character falls within the 0 – 9 numeric range. Should the character fall outside this range, the method then evaluates a wide array of other Unicode characters deemed as valid digits, which might introduce unexpected complications in applications not equipped to handle such diversity.
Finding and Extracting a Number Using StringBuilder
To achieve better performance, especially when handling large strings or a significant number of string operations, we can utilize StringBuilder
. This approach minimizes memory overhead and enhances processing speed.
First, let’s define an AddNumberToList()
method that attempts to parse a number and, if it’s valid, adds it to a list:
[MethodImpl(MethodImplOptions.AggressiveInlining)] private static void AddNumberToList(ReadOnlySpan<char> numberSpan, List<double> numbers) { if (double.TryParse(numberSpan, NumberStyles.Any, CultureInfo.InvariantCulture, out var number)) { numbers.Add(number); } }
Then let’s create a method that will search for numbers within a given string and extract them using StringBuilder
:
public static string ExtractNumberUsingStringBuilder(string inputString) { var numbers = new List<double>(); var currentNumber = new StringBuilder(); var isInsideNumber = false; foreach (var c in inputString) { if (char.IsBetween(c, '0', '9') || c == '.' || c == '-') { currentNumber.Append(c); isInsideNumber = true; } else if (isInsideNumber) { AddNumberToList(currentNumber.ToString(), numbers); currentNumber.Clear(); isInsideNumber = false; } } if (currentNumber.Length > 0) { AddNumberToList(currentNumber.ToString(), numbers); } return string.Join(",", numbers); }
Each character is examined in the loop to determine if it’s a digit using char.IsBetween()
. If a character qualifies as a digit, we append it to the StringBuilder
. When we reach a character that does not represent a valid number, we add the previously constructed number stored in the StringBuilder
to the numbers
list. Finally, the list of extracted numbers is concatenated and returned.
Using Span and SearchValues to Find and Extract a Number From a String
A modern approach to string parsing involves utilizing Span
for improved performance and memory efficiency. Span
allows for direct access to the underlying memory of a string without additional allocations.
First, we’ll define a SearchValues<char> of valid numerical characters including digits, minus sign, and decimal point:
private static readonly SearchValues<char> NumericSearchValues = SearchValues.Create(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '-', '.']);
SearchValues<T>
, which were added in .NET 8, are specifically optimized for use in searching, and so we utilize them here for searching our input string:
public static string ExtractNumberUsingSpan(string inputString) { var numbers = new List<double>(); var inputStringSpan = inputString.AsSpan(); while (true) { var startIndex = inputStringSpan.IndexOfAny(NumericSearchValues); if (startIndex == -1) break; inputStringSpan = inputStringSpan[startIndex..]; var endIndex = inputStringSpan.IndexOfAnyExcept(NumericSearchValues); if (endIndex == -1) { AddNumberToList(inputStringSpan, numbers); break; } AddNumberToList(inputStringSpan[..endIndex], numbers); inputStringSpan = inputStringSpan[endIndex..]; } return string.Join(",", numbers); }
The method uses ReadOnlySpan<char>
to efficiently iterate through the input string, identifying and extracting numbers. It leverages the IndexOfAny()
and IndexOfAnyExcept()
methods to search for valid numerical values and characters that are not valid numerical values, respectively, within the span. Throughout the operation, we repeatedly slice the span both to extract the numbers, as well as to reduce our search space on our next iteration. All extracted numbers are added to a list, which lastly, we concatenate and return.
Performance Comparison
Let’s use BenchmarkDotNet to benchmark and compare the different methods discussed. Benchmarking empowers developers to choose the most suitable method for a given scenario, balancing readability and performance:
| Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | |-------------------------------------- |---------:|---------:|---------:|---------:|------:|-------:|----------:|- | ExtractNumberUsingLinqMethod | 456.1 ns | 9.16 ns | 23.64 ns | 448.0 ns | 1.00 | 0.3633 | 760 B | | ExtractNumberUsingSpanMethod | 508.6 ns | 9.98 ns | 9.80 ns | 505.5 ns | 1.09 | 0.0839 | 176 B | | ExtractNumberUsingStringBuilderMethod | 594.4 ns | 11.70 ns | 13.93 ns | 588.0 ns | 1.29 | 0.1755 | 368 B | | ExtractNumberUsingRegExMethod | 719.8 ns | 14.21 ns | 24.51 ns | 708.2 ns | 1.58 | 0.4358 | 912 B |
From the results, we observe that the ExtractNumberUsingRegExMethod
is the slowest in performance, running approximately 1.6 times slower than our baseline ExtractNumberUsingLinqMethod
method. The ExtractNumberUsingStringBuilderMethod
runs approximately 1.3 times slower than our baseline method. And lastly, our ExtractNumberUsingSpan
method is only slightly slower (~1.1x), but has the added advantage of the lowest amount of memory allocation.
Ease of Use
In addition to performance comparison, another crucial factor is the ease of use for developers. We’ve outlined several key factors from the usability perspective for each method covered above.
Using Regex
requires some familiarity with regular expressions, which can be intimidating for beginners. With Source Generators, developers can define Regex
patterns and associated parsing methods straightforwardly. The generated code handles the pattern matching and extraction, abstracting away the complexities of Regex
implementation.
LINQ is intuitive and easy to understand, making it suitable for developers of all levels. It provides a straightforward approach to filtering numeric characters from a string.
Using StringBuilder
and Char.IsBetween
requires a basic understanding of string manipulation but is easy to implement. It offers a balance between performance and simplicity, making it suitable for a wide range of scenarios.
Finally, using Span
in connection with SearchValues<T>
for string parsing offers improved performance and memory efficiency without sacrificing ease of use. It provides direct access to the underlying memory of a string, allowing for efficient pattern matching and extraction.