HTML, since its inception, has played a pivotal role in web development, serving as the foundational markup language for creating and structuring web pages. As developers, we engage with web-related data in our programming endeavors, it is a frequent scenario to encounter HTML content embedded within strings. These strings can originate from diverse sources such as user input, database entries, or external APIs. Effectively processing and manipulating this data often necessitates the handling, and sometimes removal, of HTML tags from the strings. In this article, we embark on an exploration of various approaches employed to remove HTML tags from a string.
With that, let’s dive in.
Using Regular Expressions to Remove HTML Tags
A widely used and simple technique for removing HTML tags from a string involves the application of regular expressions. Regular expressions offer a succinct and robust approach to pattern matching, facilitating the straightforward identification and removal of HTML tags within a given string.
Let’s see how we use regular expressions:
public static string UseRegularExpression(string input) { var result = Regex.Replace(input, "<.*?>", string.Empty); return result; }
Here, we employ the Replace()
method from the System.Text.RegularExpressions.Regex
class. While the method offers various overload parameter choices, our configuration involves specifying three parameters. These parameters include the input text on which the operation is performed, the string pattern to be searched within the input text, and the third parameter, which represents the replacement value when the pattern is matched. In our sample, we initiate the process by utilizing the <.*?>
pattern to substitute HTML tags.
We can now call this method:
const string HTML_TEXT = "<p>This is <b>bold</b> text with <a href='#'>HTML</a> tags. 5 < 7</p>"; Console.WriteLine("RegularExpression: {0}", HtmlTagRemover.UseRegularExpression(HTML_TEXT));
Here, we define a constant HTML_TEXT
variable as the input value. After that, we call the UseRegularExpression()
method and write the output to the console.
Let’s now inspect the result:
RegularExpression: This is bold text with HTML tags. 5 < 7
Here, we’ve evaded the inclusion of HTML tags such as <p>
, <a>
, and <b>
, yet encoded values like <
and
persist in the output. In the upcoming section, we will present a solution to eliminate these encoded entities.
Use HttpUtility.HtmlDecode
The HtmlDecode()
method, within the System.Net.WebUtility
class, facilitates the decoding of HTML-encoded strings. This method is commonly used to reverse the encoding applied by the HtmlEncode()
method or to handle user input containing HTML-encoded characters, like <
for <
and &
for &
. It is not mainly designed to remove HTML tags from string but is useful to decode HTML-encoded characters back to the original.
Let’s use the HtmlDecode()
method together with regular expressions to remove HTML tags:
public static string UseHtmlDecode(string input) { var result = UseRegularExpression(input); result = WebUtility.HtmlDecode(result); return result; }
Here, we create a UseHtmlDecode()
method. Within this method, we invoke the previously generated UseRegularExpression()
method to eliminate HTML tags. Subsequently, we employ the capabilities of the System.Web.HttpUtility.HtmlDecode()
method to decode the resulting content.
Let’s call this method:
Console.WriteLine("HtmlDecode: {0}", HtmlTagRemover.UseHtmlDecode(HTML_TEXT));
When we check the output once again we see that encoded values are handled:
HtmlDecode: This is bold text with HTML tags. 5 < 7
Use HtmlAgilityPack to Remove HTML Tags
HtmlAgilityPack is a popular NuGet package. It provides a more structured approach to HTML parsing and allows us to easily navigate and manipulate HTML documents. By using these capabilities, we can easily use it to extract HTML tags from our text.
To start using HTML Agility Pack, we can install it using NuGet Package Manager:
PM> Install-Package HtmlAgilityPack
After completing the installation process, we can easily parse an HTML string effortlessly:
public static string UseHtmlAgilityPack(string input) { var htmlDoc = new HtmlDocument(); htmlDoc.LoadHtml(input); var result = htmlDoc.DocumentNode.InnerText; result = WebUtility.HtmlDecode(result); return result; }
Here, we generate an instance of HtmlDocument
class. After that, we call the LoadHtml()
method and it utilizes our input html. Then, we just use htmlDoc.DocumentNode.InnerText
property to get inner text which does not contain any HTML tags. Again, to get rid of decoded values like
and <
values, we call the WebUtility.HtmlDecode()
method.
Let’s call the UseHtmlAgilityPack()
method:
Console.WriteLine("HtmlAgilityPack: {0}", HtmlTagRemover.UseHtmlAgilityPack(HTML_TEXT));
Once again, let’s check the result:
HtmlAgilityPack: This is bold text with HTML tags. 5 < 7
Using AngleSharp Package to Remove HTML Tags
AngleSharp stands out as a robust HTML parsing library that aligns with the most recent HTML5 specifications. It provides a user-friendly approach to querying and manipulating HTML documents. If you wish to familiarize yourself with AngleSharp library, be sure to check out AngleSharp Documentation.
First, we need to install the AngleSharp nuget package. To do that we simply need to run the Nuget command:
PM> Install-Package AngleSharp
Let’s explore the utilization of the AngleSharp library’s capabilities to remove HTML tags:
public static string UseAngleSharp(string input) { var parser = new HtmlParser(); var document = parser.ParseDocument(input); var result = document.Body!.TextContent; result = WebUtility.HtmlDecode(result); return result; }
Here, we establish a UseAngleSharp()
method to leverage the AngleSharp library. Within this method, we first instantiate an instance of the HtmlParser
class. Subsequently, we invoke the ParseDocument()
method, and finally, we attain the result using the document.Body!.TextContent
property.
Please note that, in this context, we refrain from calling the HtmlDecode()
method. AngleSharp takes care of both removing HTML tags and unescaping the decoded HTML characters. Consequently, when we utilize this method, similar to HtmlAgilityPack, AngleSharp will yield identical results.
Remove HTML Tags With System.Xml.Linq
Within the System.Xml.Linq namespace, the XElement
class enables the parsing and manipulation of content, accommodating both XML and structures reminiscent of HTML. Treating HTML as XML harnesses the power of XML capabilities, providing a versatile approach to working with diverse document structures.
Let’s use the power of XML to remove HTML tags from a string:
public static string UseXmlXElement(string input) { var removedNbspHtml = input.Replace(" ", " "); var xElement = XElement.Parse("<root>" + removedNbspHtml + "</root>"); return xElement.Value; }
Here, we create the UseXmlXElement()
method. Within this method, we begin by replacing occurrences of
values, since they are not recognized within the XML structure. Subsequently, we invoke the Parse()
method of the XElement
class, where we provide an XML string containing the <root>
structure. Finally, we get the result by returning the Value
property.
By invoking the UseXmlXElement()
method, similar to HttpAgilityPack and AngleSharp, we will obtain identical results.
Comparing the Performance
Having all our methods prepared, we are now poised to assess their time and memory performance. To facilitate this evaluation, we will make use of the BenchmarkDotNet library.
Preparation
Instead of relying on user input, we will extract the HTML content from a file. In this particular case, we create an HTML file named long.html
for the scenario. Please refer to the source code to view the file contents.
First, let’s create a benchmark class:
[Orderer(SummaryOrderPolicy.FastestToSlowest)] [Config(typeof(AntiVirusFriendlyConfig))] [MemoryDiagnoser] public class RemoveHtmlTagBenchmark { private string HTML_TEXT = string.Empty; [GlobalSetup] public void Setup() { HTML_TEXT = File.ReadAllText("./long.html"); } }
Here, we create the RemoveHtmlTagBenchmark
class to conduct benchmarking. Within the Setup()
method, we retrieve the HTML content and assign it to the HTML_TEXT
variable.
Benchmark
Now we are ready to run our benchmark and analyze the outcomes:
| Method | Mean | Error | StdDev | Gen0 | Gen1 | Allocated | |--------------------- |----------:|----------:|-----------:|---------:|---------:|-----------:| | UseRegularExpression | 41.59 us | 5.602 us | 8.384 us | 2.0142 | - | 24.75 KB | | UseHtmlDecode | 44.27 us | 5.133 us | 7.683 us | 2.0142 | - | 24.75 KB | | UseXmlXElement | 66.71 us | 7.679 us | 11.494 us | 11.8408 | 2.3193 | 145.17 KB | | UseAngleSharp | 266.26 us | 37.716 us | 56.452 us | 29.7852 | 14.1602 | 366.17 KB | | UseHtmlAgilityPack | 637.91 us | 71.085 us | 106.397 us | 263.6719 | 231.4453 | 3242.89 KB |
As evident from our benchmark results, Regular Expressions stand out as the fastest method to remove HTML tags from a string. In terms of performance, both the UseRegularExpression()
and the UseHtmlDecode()
method outperformed the others, highlighting their superiority with the fastest execution times and minimal memory allocation. Conversely, AngleSharp
and HtmlAgilityPack
demonstrated slower performance coupled with higher memory usage, underscoring that the former methods are more efficient for the given task.
Conclusion
In the realm of C# text processing, effectively removing HTML tags and non-breaking spaces necessitates a thoughtful exploration of available approaches. This article has delved into various methods, each carrying its own merits and considerations, providing developers with a range of options tailored to their specific requirements and preferences.
Regular expressions emerge as a swift and concise solution for basic HTML tag removal, offering simplicity and effectiveness. However, their suitability may diminish when confronted with more intricate HTML structures. Dedicated HTML parsing libraries like HtmlAgilityPack and AngleSharp bring a structured and sophisticated approach to HTML manipulation, proving ideal for applications dealing with complex HTML scenarios.
An intriguing alternative lies in the System.Xml.Linq namespace, treating HTML as XML and furnishing a robust framework for parsing and manipulating the document. In scenarios involving HTML entities such as “ ”, the HttpUtility.HtmlDecode method from the System.Net.WebUtility class provides a convenient means of decoding these entities back to their original characters.
To sum up, the selection of a method hinges on the specific needs of your application. For straightforward scenarios, regular expressions may suffice, while more intricate HTML manipulation may warrant the use of dedicated parsing libraries.