In this article, we will learn the basics of regular expressions in C# and how we can use them in our C# applications.
Let’s start.
What is Regular Expression in C#
Regular expressions (Regex) are character sequences that we can use to match patterns in strings.
Using regular expressions in C#, we can determine whether a pattern appears in a sample string and count how many times it does. We can easily retrieve all those occurrences. And, finally, we can perform string replacement operations based on them.
A common scenario in which regular expressions are often used is user input validation.
Matching String Patterns With Regular Expressions in C#
To start using regular expressions in C# we need to include the System.Text.RegularExpressions
namespace in our program.
We’re going to create an instance of the Regex
class passing our expression as the only constructor parameter. Once we have our Regex
instance we can use its methods to work with strings:
var regex = new Regex("a"); var isMatch = regex.IsMatch("Learn C# language"); Assert.True(isMatch);
Here, we use a single "a"
character as our regular expression. It will match any “a” character found in a string. Then, we use the IsMatch()
method of the Regex
class passing our sample string as a parameter.
The IsMatch()
method returns a boolean
value indicating whether the sample string matches the regex pattern. In our example, it will return true
since our sample string contains three "a"
characters.
But, what if we want to know exactly how many times our pattern appears in the sample string? In that case, we’ll be calling the Matches()
method of the Regex
class. It will return a collection of Match
objects representing each occurrence:
var regex = new Regex("a"); var matches = regex.Matches("Learn C# language"); Assert.Equal(3, matches.Count);
As we can see, our string has been found 3 times.
Anchors
Anchors do not match any character. They assert the position of a match relative to a well-known location, for example, the beginning of the string.
On many occasions, we will need to know where our matching substring sits relative to the beginning, or the end of the sample string. For that, we will use the most common regex anchors: ^
and $
to respectively match the beginning and the end of the string:
var regex = new Regex("^hello$"); var isMatch = regex.IsMatch("hello"); Assert.True(isMatch); isMatch = regex.IsMatch("hello world"); Assert.False(isMatch);
Here, we define a regular expression ^hello$
. It matches the literal string "hello"
and, on top of that, we use the ^
and $
anchors to ensure that our match will be located right after the beginning and right before the end of a sample string.
Therefore, our expression will match the string "hello"
but won’t match "hello world"
since, in the latter, the matching substring is not located right before the end of the string.
Character Classes
So far, our expressions only matched literal characters. Now let’s learn how to match previously unknown strings using character classes, sets, and ranges.
Character classes in regular expressions are special markers that match the entire category of characters:
Character Class | Description |
---|---|
. | Matches one character of any kind except line endings |
\d | Matches one digit from 0 to 9 |
\w | Matches one letter digit or underscore |
\s | Matches one whitespace character like, for example, tabs, spaces or new lines |
\D | Matches one character that is not a digit |
\W | Matches one character that is not a letter digit or underscore |
\S | Matches one character that is not whitespace |
Let’s use the character class \d
to match all the digits of a street number in any random address:
var regex = new Regex(@"\d"); var matches = regex.Matches(@"9841 Shadow Way St Sunland, California(CA)"); Assert.Equal(4, matches.Count); Assert.Equal("9", matches[0].Value); Assert.Equal("8", matches[1].Value); Assert.Equal("4", matches[2].Value); Assert.Equal("1", matches[3].Value);
Here, we see how our expression matches all digits in our sample string but, since \d
matches only one character, what we got is four different matches, one per each numeric character.
Quantifiers
But what do we do if we want to match the entire street number? For that, we need to use regex quantifiers.
Quantifiers apply to the expression located immediately to its left, and tell the engine to apply that expression a specific number of times:
var regex = new Regex(@"\d+"); var match = regex.Match(@"9841 Shadow Way St Sunland, California(CA)"); Assert.Equal("9841", match.Value);
Here, we introduce the +
quantifier that matches an expression one or more times. This quantifier in combination with \d
will match one or more contiguous digits. Consequently, it will catch the entire street number in our sample string in a single match.
This time, we used the Match()
method of the Regex
class that checks the sample string until it finds the first match and returns it ignoring the rest.
Quantifier | Description |
---|---|
? | Match zero or one time |
* | Match zero or more times |
+ | Match one or more times |
{n} | Being n a number match exactly n times |
{n,} | Being n a number match n or more times |
{n,m} | Being both n and m numbers match between n and m times |
Sets, Ranges, and Exclusions
A set in regex is a group of two or more characters or character classes that will match a character as long as it matches any of the items in the group. To denote a set, we use square brackets. For example, [ae\d]
is a set that will match either a letter "a"
, a letter "b",
or any digit \d
:
var regex = new Regex("[sg]old"); var matches = regex.Matches("sold cold gold"); Assert.Equal(2, matches.Count); Assert.Equal("sold", matches[0].Value); Assert.Equal("gold", matches[1].Value);
In this example, we define a set [sg]old
that includes both "s"
and "g"
characters. It will match the words "sold"
and "gold"
but not the word "cold"
.
Ranges, on the other hand, use the square bracket notation to represent a set of items by specifying the first and last characters of the sequence separated by a hyphen:
var regex = new Regex(@"[a-d]+"); var match = regex.Match(@"abcdefghi"); Assert.Equal("abcd", match.Value);
We can add the character ^
before a set or range inside the square brackets to tell the engine to exclude all the characters in the set or range:
var regex = new Regex("[^a-d]+"); var match = regex.Match("abcdefghi"); Assert.Equal("efghi", match.Value);
Extracting Text With Capture Groups
We can define capture groups by enclosing a section of our regular expression in parentheses. After a call to Regex.Match()
, the Groups
property of the Match
object will contain information about defined groups:
var regex = new Regex(@"^(\w+)\.(\w{3})$"); var match = regex.Match("test_file_name.txt"); Assert.Equal("test_file_name", match.Groups[1].Value); Assert.Equal("txt", match.Groups[2].Value);
Here, we have created a regular expression that matches windows file names with a dot and a three-character extension at the end. Using parenthesis syntax, we define two capture groups: One for the file name alone and another for the extension.
Using the Groups
property of the Match
object we can access a zero-based collection of group results. Bear in mind that index zero will be always used for the full regex match. We can find the results for the groups we defined starting at index one.
As we can see, we were able to extract the file name and extension separately, leaving behind the separating dot.
Replacing Text Using Regular Expressions in C#
We can use the Regex.Replace()
method to replace all matches of our expression in a sample string with some fixed string:
var regex = new Regex(@"</?\w+>"); var result = regex.Replace("<b>Hello, <i>world</i></b>", string.Empty); Assert.Equal("Hello, world", result);
This time, our task is to remove all the HTML tags from the input text. For that, we build an expression that matches the literal opening angle bracket <
. Then, an optional forward slash /?
one or more letters or digits \w+
and, finally, a literal closing angle bracket >
.
Now, we can call the Replace
method on any sample string containing HTML tags and pass a string.Empty
as the second parameter. Since our regex matches every HTML tag in the text, the resulting string will have all of them removed.
Conclusion
In this article, we have learned how to build simple regular expressions in C# using character classes, anchors, and quantifiers. At the same time, we used methods and properties in the Regex
class to create C# code that works with regular expressions.
The concepts we discussed will help us deal with many common string parsing scenarios like user input validation.