In this article, we will look into a comprehensive guide on how to get the HTML content of a web page as a string and how to parse the content retrieved.
First things first, we’ll learn how to retrieve HTML content as a string from the web using tools like HttpClient, HtmlAgilityPack, and AngleSharp. Then, we’ll tackle parsing, which is understanding the HTML’s structure, and how to manipulate or analyze it. Here, we’ll look into the functionalities both HtmlAgilityPack and AngleSharp offer us for parsing HTML documents. We’ll explore how these tools help us navigate through the HTML code and extract the juicy bits we need.
Let’s roll up our sleeves, and embark on this exciting learning journey.
Why Should We Get an HTML Content as a String
There are several situations where we might need to obtain HTML content as a string. For instance, assume we’re building a web scraper, or a testing suite, or maybe we want to parse and manipulate some HTML data.
Importantly, having HTML as a string simplifies the process of extracting and working with data, making it easier to discover and utilize the information it contains.
We can take several approaches, each with its pros and cons. Some of the approaches include using built-in .NET libraries to third-party Nuget packages, there’s no shortage of options to suit our needs.
Get an HTML Content With HttpClient
Using HttpClient
to fetch HTML content provides great convenience. With minimal code, we can effortlessly retrieve HTML from any URL and use it as we deem fit in our application. Furthermore, HttpClient
comes baked right into .NET, eliminating the need for extra packages or dependencies.
HttpClient
, check out our great series HttpClient with ASP.NET Core Tutorial.After creating our .NET Core console application, let’s create a new class and name it HtmlHttp
:
public class HtmlHttp { private readonly HttpClient _httpClient; public HtmlHttp(IHttpClientFactory httpClientFactory) { _httpClient = httpClientFactory.CreateClient(); } public async Task<string> GetHtmlAsStringAsync(string url) { var response = await _httpClient.GetAsync(url); response.EnsureSuccessStatusCode(); var htmlContent = await response.Content.ReadAsStringAsync(); return htmlContent; } }
First, we inject an IHttpClientFactory and use it to create an instance of HttpClient
. Then, we define a method GetHtmlAsStringAsync()
that asynchronously fetches HTML content from a given URL using HttpClient
.
Now, we can use the method to get the HTML content of a webpage inside the Program
class:
var url = "https://www.wikipedia.org/"; var htmlHttp = new HtmlHttp(host.Services.GetRequiredService<IHttpClientFactory>()); var html = await htmlHttp.GetHtmlAsStringAsync(url);
Here, we define a url
and assign https://www.wikipedia.org/
as its value.
Then, we create an instance of our HtmlHttp
class. Next, we use the GetHtmlAsStringAsync()
method of the HtmlHttp
instance, awaiting its completion to fetch the HTML content from the provided URL. Finally, we store the resulting HTML content in the html variable, representing the webpage’s content.
Get HTML Content as a String With HtmlAgilityPack
In addition to using HttpClient
to fetch HTML content in .NET, another approach we can leverage is the HtmlAgilityPack library. HtmlAgilityPack simplifies the parsing and manipulation of HTML documents offering us flexibility and ease of use. We will explore how to use HtmlAgilityPack to obtain HTML content as a string in our .NET app.
Firstly, let’s install the HtmlAgilityPack NuGet package using the .NET command line:
dotnet add package HtmlAgilityPack
Afterwards, let’s create a new class and name it HtmlAgility
, and add a method to get the HTML string:
public class HtmlAgility { public string GetHtmlAsString(string url) { var web = new HtmlWeb(); var doc = web.Load(url); return doc.DocumentNode.OuterHtml; } }
Similar to our previous example, we define a method GetHtmlAsString()
that takes a URL as input and returns the HTML content of the webpage at that URL as a string. Instead, this time we use HTMLAgilityPack to achieve the same result as in our previous example.
Use AngleSharp to Get HTML Content as a String
Another excellent option to fetch HTML content in .NET is AngleSharp. AngleSharp is a .NET library that allows us to parse angle bracket-based hyper-texts like HTML, SVG, and MathML. But it also provides us the ability to fetch HTML content from a given URL directly. Let’s explore how we can use AngleSharp to retrieve HTML content as a string, Next.
First, let’s install the AngleSharp NuGet package using the .NET command line:
dotnet add package AngleSharp
Subsequently, let’s create a new class, HtmlAngle
and add a method to get the HTML string:
public class HtmlAngle { public async Task<string> GetHtmlAsStringAsync(string url) { var config = Configuration.Default.WithDefaultLoader(); var context = BrowsingContext.New(config); var document = await context.OpenAsync(url); return document.Body.InnerHtml; } }
This time, we define a method GetHtmlAsStringAsync()
that accepts a URL as a string parameter and fetches the HTML content from the given URL using AngleSharp. Equally important, is the Configuration.Default.WithDefaultLoader()
method that sets up the default configuration with a loader for handling web requests.
Afterwards, we create a new browsing context using BrowsingContext.New()
and open the given URL with context.OpenAsync()
. Finally, we return the InnerHtml
property of the document’s body, representing the HTML content as a string.
Parsing HTML Content
Parsing HTML content involves navigating the Document Object Model (DOM) hierarchy and extracting relevant elements or data points. In our application, we successfully implement methods for efficiently obtaining HTML content as strings. However, our objectives often require more than just retrieval; there are instances where a deep understanding of the HTML structure is necessary. For instance, dissecting specific segments, analyzing their attributes, or extracting relevant data. This is precisely where parsing comes in.
That is to say, by parsing the HTML content, we gain the ability to navigate through its structure, which enables us to extract, analyze, and manipulate targeted information effectively. In the upcoming sections, we will explore a range of methods and solutions for parsing HTML content.
Parsing With HtmlAgilityPack
HtmlAgilityPack makes it easy to traverse the DOM tree, selecting elements according to criteria such as tag names or attributes, and extracting inner text or attribute values.
We’ve seen how we can retrieve HTML content using HtmlAgilityPack. Next, we’ll go one step further and use the library to parse the HTML content by adding a new method:
public List<string> GetList(string htmlContent) { var doc = new HtmlDocument(); doc.LoadHtml(htmlContent); var nodes = doc.DocumentNode.SelectNodes("//ul[@class='list']/li"); var result = new List<string>(); if (nodes != null) { foreach (var node in nodes) { result.Add(node.InnerText); } } return result; }
Here, we add a method to the HtmlAgility
class to parse HTML content and extract values of given criteria, returning a List<string>
type that contains the values. Then, we select all <li>
elements that are children of <ul>
elements with the class named list
.
To sum it up, the SelectNodes
method traverses the DOM tree and returns a collection of HtmlNode
objects representing the selected elements.
Parsing With AngleSharp
Let’s now explore how we can accomplish a similar outcome to our previous example, but this time, using AngleSharp for parsing HTML content:
public List<string> GetList(string htmlContent) { var parser = new HtmlParser(); var document = parser.ParseDocument(htmlContent); var nodes = document.QuerySelectorAll("ul.list > li"); var result = new List<string>(); foreach (var node in nodes) { result.Add(node.TextContent); } return result; }
Here, we create a GetList(htmlContent)
method just like our previous example. Inside the method, we instantiate the HtmlParser
class and use its ParseDocument(htmlContent)
method to load the given HTML content. Afterwards, we use the QuerySelectorAll
method to select all <li>
elements that are direct children of <ul>
elements with the class list
.
Finally, the method returns a NodeList
containing all matching elements. We iterate over each element in the NodeList
, extract their, and add them to a List<string>
.
Benchmark GetHtmlAsString Methods
With the various methods available to us in getting HTML content as a string, we face the challenge of choosing the right method to use. We can use the BenchmarkDotnet library to evaluate the performance of these methods.
For our benchmark, we will run each of the methods with the same URL, https://www.wikipedia.org/
.
Let’s view the result of our benchmark test:
Method | Mean | Error | Median | Allocated | ------------------------------- |---------:|---------:|---------:|-----------:| GetHtmlAsStringHttpClient | 217.2 ms | 4.34 ms | 213.9 ms | 232.14 KB | GetHtmlAsStringHtmlAgilityPack | 224.0 ms | 4.33 ms | 222.5 ms | 6608.18 KB | GetHtmlAsStringAngleSharp | 581.0 ms | 11.14 ms | 581.3 ms | 1852.59 KB |
From our results, the GetHtmlAsStringHttpClient()
method excels in terms of execution speed and memory allocation, which makes it the top choice for high-performance scenarios.
Notably, the GetHtmlAsStringHtmlAgilityPack()
method demonstrates execution speeds closely comparable to the GetHtmlAsStringHttpClient(),
despite showing significant resource allocation. This suggests its potential viability in contexts where resource constraints are less strict.
On the other hand, the GetHtmlAsStringAngleSharp()
method, while proficient in memory allocation in caparison to the GetHtmlAsStringHtmlAgilityPack()
, lags in execution speed.
Finally, the use of HttpClient for getting HTML content as a string offers the best balance of speed and memory efficiency. However, it does not offer parsing capabilities, unlike the other approaches. In such cases, HtmlAgilityPack or AngleSharp could be better options, even if they come at the cost of potentially slower execution speed or higher memory usage.
Conclusion
In this article, we explore some methods for both retrieving and parsing the HTML content of a web page. We utilize the .NET HttpClient built-in library to retrieve HTML Content as a string. Additionally, we explore some powerful third-party NuGet packages like HtmlAgilityPack and AngleSharp in retrieving and parsing HTML Content.
HttpClient provides us with a simple and convenient way to effortlessly fetch HTML content from any URL, readily available within the .NET framework.
HtmlAgilityPack offers us the ability to retrieve, parse, and manipulate HTML documents, with flexibility and ease of use when retrieving HTML as strings.
Just like HTMLAgilityPack, AngleSharp offers us the ease of parsing HTML content. Additionally, we can also directly fetch HTML content from URLs.