C#

Introduction to Regular Expressions in C#

In this article, we will learn the basics of regular expressions in C# and how we can use them in our C# applications.

To download the source code for this article, you can visit our GitHub repository.

Let’s start.

What is Regular Expression in C#

Regular expressions (Regex) are character sequences that we can use to match patterns in strings.

Using regular expressions in C#, we can determine whether a pattern appears in a sample string and count how many times it does. We can easily retrieve all those occurrences. And, finally, we can perform string replacement operations based on them.

A common scenario in which regular expressions are often used is user input validation.

Matching String Patterns With Regular Expressions in C#

To start using regular expressions in C# we need to include the System.Text.RegularExpressions namespace in our program.

We’re going to create an instance of the Regex class passing our expression as the only constructor parameter. Once we have our Regex instance we can use its methods to work with strings:

var regex = new Regex("a");
var isMatch = regex.IsMatch("Learn C# language");

Assert.True(isMatch);

Here, we use a single "a" character as our regular expression. It will match any “a” character found in a string. Then, we use the IsMatch() method of the Regex class passing our sample string as a parameter.

The IsMatch() method returns a boolean value indicating whether the sample string matches the regex pattern. In our example, it will return true since our sample string contains three "a" characters.

But, what if we want to know exactly how many times our pattern appears in the sample string? In that case, we’ll be calling the Matches() method of the Regex class. It will return a collection of Match objects representing each occurrence:

var regex = new Regex("a");
var matches = regex.Matches("Learn C# language");

Assert.Equal(3, matches.Count);

As we can see, our string has been found 3 times.

Anchors

Anchors do not match any character. They assert the position of a match relative to a well-known location, for example, the beginning of the string.

On many occasions, we will need to know where our matching substring sits relative to the beginning, or the end of the sample string. For that, we will use the most common regex anchors: ^ and $ to respectively match the beginning and the end of the string:

var regex = new Regex("^hello$");
var isMatch = regex.IsMatch("hello");
Assert.True(isMatch);

isMatch = regex.IsMatch("hello world");
Assert.False(isMatch);

Here, we define a regular expression ^hello$. It matches the literal string "hello" and, on top of that, we use the ^ and $ anchors to ensure that our match will be located right after the beginning and right before the end of a sample string.

Therefore, our expression will match the string "hello" but won’t match "hello world" since, in the latter, the matching substring is not located right before the end of the string.

Character Classes

So far, our expressions only matched literal characters. Now let’s learn how to match previously unknown strings using character classes, sets, and ranges.

Character classes in regular expressions are special markers that match the entire category of characters:

Character ClassDescription
.Matches one character of any kind except line endings
\dMatches one digit from 0 to 9
\wMatches one letter digit or underscore
\sMatches one whitespace character like, for example, tabs, spaces or new lines
\DMatches one character that is not a digit
\WMatches one character that is not a letter digit or underscore
\SMatches one character that is not whitespace

Let’s use the character class \d to match all the digits of a street number in any random address:

var regex = new Regex(@"\d");
var matches = regex.Matches(@"9841 Shadow Way St
                            Sunland, California(CA)");

Assert.Equal(4, matches.Count);
Assert.Equal("9", matches[0].Value);
Assert.Equal("8", matches[1].Value);
Assert.Equal("4", matches[2].Value);
Assert.Equal("1", matches[3].Value);

Here, we see how our expression matches all digits in our sample string but, since \d matches only one character, what we got is four different matches, one per each numeric character. 

Quantifiers

But what do we do if we want to match the entire street number? For that, we need to use regex quantifiers.

Quantifiers apply to the expression located immediately to its left, and tell the engine to apply that expression a specific number of times:

var regex = new Regex(@"\d+");
var match = regex.Match(@"9841 Shadow Way St
                            Sunland, California(CA)");

Assert.Equal("9841", match.Value);

Here, we introduce the + quantifier that matches an expression one or more times.  This quantifier in combination with \d will match one or more contiguous digits. Consequently, it will catch the entire street number in our sample string in a single match.

This time, we used the Match() method of the Regex class that checks the sample string until it finds the first match and returns it ignoring the rest.

QuantifierDescription
?Match zero or one time
*Match zero or more times
+Match one or more times
{n}Being n a number match exactly n times
{n,}Being n a number match n or more times
{n,m}Being both n and m numbers match between n and m times

Sets, Ranges, and Exclusions

A set in regex is a group of two or more characters or character classes that will match a character as long as it matches any of the items in the group. To denote a set, we use square brackets. For example, [ae\d] is a set that will match either a letter "a", a letter "b", or any digit \d:

var regex = new Regex("[sg]old");
var matches = regex.Matches("sold cold gold");

Assert.Equal(2, matches.Count);
Assert.Equal("sold", matches[0].Value);
Assert.Equal("gold", matches[1].Value);

In this example, we define a set [sg]old that includes both "s" and "g" characters. It will match the words "sold" and "gold" but not the word "cold".

Ranges, on the other hand, use the square bracket notation to represent a set of items by specifying the first and last characters of the sequence separated by a hyphen:

var regex = new Regex(@"[a-d]+");
var match = regex.Match(@"abcdefghi");

Assert.Equal("abcd", match.Value);

We can add the character ^ before a set or range inside the square brackets to tell the engine to exclude all the characters in the set or range:

var regex = new Regex("[^a-d]+");
var match = regex.Match("abcdefghi");

Assert.Equal("efghi", match.Value);

Extracting Text With Capture Groups

We can define capture groups by enclosing a section of our regular expression in parentheses. After a call to Regex.Match(), the Groups property of the Match object will contain information about defined groups:

var regex = new Regex(@"^(\w+)\.(\w{3})$");
var match = regex.Match("test_file_name.txt");

Assert.Equal("test_file_name", match.Groups[1].Value);
Assert.Equal("txt", match.Groups[2].Value);

Here, we have created a regular expression that matches windows file names with a dot and a three-character extension at the end. Using parenthesis syntax, we define two capture groups: One for the file name alone and another for the extension.

Using the Groups property of the Match object we can access a zero-based collection of group results. Bear in mind that index zero will be always used for the full regex match. We can find the results for the groups we defined starting at index one.

As we can see, we were able to extract the file name and extension separately, leaving behind the separating dot.

Replacing Text Using Regular Expressions in C#

We can use the Regex.Replace() method to replace all matches of our expression in a sample string with some fixed string:

var regex = new Regex(@"</?\w+>");
var result = regex.Replace("<b>Hello, <i>world</i></b>", string.Empty);

Assert.Equal("Hello, world", result);

This time, our task is to remove all the HTML tags from the input text. For that, we build an expression that matches the literal opening angle bracket <. Then, an optional forward slash /? one or more letters or digits \w+ and, finally, a literal closing angle bracket >.

Now, we can call the Replace method on any sample string containing HTML tags and pass a string.Empty as the second parameter. Since our regex matches every HTML tag in the text, the resulting string will have all of them removed.

Conclusion

In this article, we have learned how to build simple regular expressions in C# using character classes, anchors, and quantifiers. At the same time, we used methods and properties in the Regex class to create C# code that works with regular expressions.

The concepts we discussed will help us deal with many common string parsing scenarios like user input validation.

Code Maze

Share
Published by
Code Maze

Recent Posts

Code Maze Weekly #141

Issue #141 of the Code Maze weekly. Check out what's new this week and enjoy…

Updated Date Sep 30, 2022

How to Read AppSettings Values From a JSON File in .NET Core

In this post, we are going to learn how to read AppSettings values from a…

Sep 28, 2022

Counting Occurrences of a Char Within a String in C#

In this article, we are going to learn how to count occurrences of a char…

Updated Date Sep 29, 2022

How to Use Shouldly to Improve Unit Tests in .NET?

In this article, we are going to explore Shouldly. Shouldly is a library that improves…

Updated Date Sep 26, 2022

Code Maze Weekly #140

Issue #140 of the Code Maze weekly. Check out what's new this week and enjoy…

Updated Date Sep 23, 2022

Using Query String Parameters with Minimal APIs

In this article, we are going to explain how we can work with query string…

Updated Date Sep 22, 2022