In this article, we will be exploring parsing HTML with AngleSharp in C#. AngleSharp is a powerful library in C# that gives us the ability to parse angle bracket-based hyper-texts like HTML, SVG, MathML, as well as XML. Our focus in this article will be to introduce the library and its most important features and capabilities and learn how we can use it.
So let’s dive in.
What Is AngleSharp?
AngleSharp is a well-known and established C# library used for parsing, manipulating, and working with HTML documents. In addition to HTML, it can also parse related formats such as CSS, SVG, or MathML. Its most common use cases include web-scraping – the process of programmatically extracting information from the internet.
Let’s now start using it. To add AngleSharp to our project, we can install it through the NuGet Package Manager:
Install-Package AngleSharp
Throughout the article, we’re going to be using a simple HTML string:
<!DOCTYPE html> <html> <style> .blue { { color: blue; } } </style> <body> <h2>Title</h2> <section id="section"> <div id="articles"> <article id="a1">Article 1 <em>content</em>.</article> </div> <p class="paragraph">This is a paragraph.</p> <ul id="list"> <li class="blue">Item 1</li> <li>Item 2</li> <li class="blue">Item 3</li> </ul> <form id="sign-up-form"> <label for="username">Username: </label> <input id="username" name="username" type="text" /> <label for="password">Password: </label> <input id="password" name="password" type="password" /> <button id="button" type="submit">Sign up</button> </form> </section> <footer>Footer</footer> </body> </html>
Now, let’s see its most basic usage of parsing our HTML string:
var config = Configuration.Default; var context = BrowsingContext.New(config); var document = await context.OpenAsync(req => req.Content(Html)); var articles = document .QuerySelectorAll<IHtmlElement>("article") .ToList(); var firstArticleTextContent = articles[0].TextContent;
First, we create an instance of the IBrowsingContext
interface, which is a required construct for parsing HTML pages. We can think of it like a tab in a standard browser.
Then, we parse the HTML string using our context and get an instance of an IDocument
in return. This is the in-memory representation of the DOM (Document Object Model) in AngleSharp.
Then, similarly to the Javascript DOM APIs, we can retrieve all article
elements using the QuerySelectorAll()
method on the document and specifying the article
tag. Lastly, we can get the content of an article using the TextContent
property.
Extracting Data Using Different Methods
Let’s get a bit more in-depth about the different capabilities of AngleSharp as well as showcase a web-scraping example illustrating its power.
AngleSharp gives us nice and elegant APIs, similar to the JavaScript DOM APIs, which we can use to query, traverse, and inspect the properties of HTML elements. The beautiful part is that we can utilize CSS selectors to find the elements we’re searching for:
var config = Configuration.Default; var context = BrowsingContext.New(config); var document = await context.OpenAsync(req => req.Content(Html)); var paragraphElements = document.Body .QuerySelectorAll<IHtmlParagraphElement>("p") .ToList(); var paragraphElementsLinq = document.All .Where(e => e.TagName.Equals("p", StringComparison.InvariantCultureIgnoreCase)) .ToList();
Here, we’re querying for all paragraph elements using the p
tag in our CSS selector. We could achieve the same by using the LINQ syntax via filtering elements on their TagName
property.
Also, we could select elements based on their attributes like classes
and/or id
:
var blueListItemElements = document.Body .QuerySelectorAll<IHtmlListItemElement>("li.blue") .ToList(); var blueListItemElementsLinq = document.All .Where(e => e.LocalName == "li" && e.ClassList.Contains("blue")) .ToList(); var formElement = document.Body.QuerySelector<IHtmlFormElement>("form#sign-up-form"); var formElementLinq = document.All .First(e => e.TagName.ToLower() == "form" && (e.Id?.Equals("sign-up-form") ?? false)); var formElementById = document.GetElementById("sign-up-form") as IHtmlFormElement;
Here, we select all li
elements that contain the class blue
. We do this by using the CSS selector or LINQ to filter the ClassList
.  Similarly, we retrieve the first form
element with the id sign-up-form
.
We can also query for an element using a CSS attribute selector:
var userNameInputElement = document.Body.QuerySelector<IHtmlInputElement>("form > input[name='username']"); var userNameInputElementLinq = document.All .First(e => e.LocalName == "input" && e.Attributes["name"]?.Value == "username");
This time, we retrieve the first input
element that’s a child of a form
element whose name attribute is username
.
Apart from querying the DOM or the body, we can also retrieve different properties of each element:
var sectionInnerHtml = section.InnerHtml; var sectionTextContent = section.TextContent; var sectionAttributes = section.Attributes; var sectionChildren = section.Children; var nextSibling = section.NextElementSibling; var previousSibling = section.PreviousElementSibling;
Here, we can get various properties such as attributes, class lists, inner HTML / text content, child nodes/elements, and more.
Using Anglesharp for a Mini Web-Scraper
Now, using the above knowledge, let’s illustrate an example where AngleSharp shines the most – we’re going to write a mini web-scraper for an online books catalog:
var booksCatalogUrl = "https://books.toscrape.com/"; var config = Configuration.Default .WithDefaultLoader() .WithJs(); var context = BrowsingContext.New(config); var document = await context.OpenAsync(new Url(booksCatalogUrl)); var booksSection = document.QuerySelector<IHtmlElement>("div.page_inner section")!; var bookInfoArticles = booksSection .QuerySelectorAll<IHtmlElement>("li > article.product_pod") .ToCollection();
Firstly, we make an HTTP request to the books catalog website to retrieve the page’s document. Then we find the book section
element by using a nested selector. Finally, we retrieve all article
elements containing product information via another CSS selector.
Next, we retrieve the information for each book. We first create a record to represent the book data structure:
public record Book(string Title, decimal Price, double Rating, string ImageUrl);
Then, we create a helper method for retrieving a Book
data structure from an IElement
instance:
private static Book ToBook(IElement e) { var imageUrl = e.QuerySelector<IHtmlImageElement>("div.image_container > a > img.thumbnail")!.Source; var titleElement = e.QuerySelector<IHtmlAnchorElement>("h3 > a")!; var title = titleElement.Title ?? titleElement.TextContent.Trim(); var price = decimal.TryParse( e.QuerySelector<IHtmlParagraphElement>("div.product_price > p.price_color")!.TextContent.Replace("\u00a3", string.Empty), out var productPrice) ? productPrice : default; var ratingElementClassList = e.QuerySelector<IHtmlParagraphElement>("p.star-rating")!.ClassList; var otherClassName = ratingElementClassList.First(s => s != "star-rating"); var rating = otherClassName switch { "One" => 1, "Two" => 2, "Three" => 3, "Four" => 4, "Five" => 5, _ => 0 }; return new Book(title, price, rating, imageUrl); }
Here, we utilize different AngleSharp CSS selectors and properties of HTML elements to retrieve the full information for a single Book
.
Firstly, we extract the src
attribute from an image to get the book’s image URL. Also, we query for the title
attribute from an a
element. We can find the book’s price by retrieving the text content from a specific paragraph. And finally, we determine the star rating of the book by looking at a paragraph’s class list.
Now, building upon our previous example, we can get the final list of catalog books by using the retrieved article
elements:
var books = bookInfoArticles.Select(ToBook).ToList();
DOM Manipulation
AngleSharp also gives us the power to directly manipulate and transform the IDocument
object or any of its elements, exposing similar methods to the JavaScript APIs.
We can easily perform operations such as adding elements:
var config = Configuration.Default .WithDefaultLoader() .WithJs(); var context = BrowsingContext.New(config); var document = await context.OpenAsync(req => req.Content(Html)); var paragraphElement = document.CreateElement("p"); paragraphElement = document.CreateElement<IHtmlParagraphElement>(); paragraphElement.TextContent = "This is a new paragraph."; document.Body.AppendChild(paragraphElement);
Here, we create a p
(paragraph) element using the CreateElement()
method and set its text content. Then we append the element to the DOM with the AppendChild()
method.
Conversely, we could remove elements as well:
var ulElement = document.QuerySelector<IHtmlUnorderedListElement>("ul#list")!; var blueLiElement = ulElement.QuerySelector<IHtmlListItemElement>("li.blue")!; blueLiElement.Remove(); ulElement.RemoveChild(blueLiElement);
This time, we remove the li
element from the list matching the provided CSS selectors by using the RemoveChild()
method.
Also, we can change an element’s properties, such as its attributes, classes, and text content / inner HTML:
var article = document.QuerySelector<IElement>("article#a1")!; article.TextContent = "New article content"; article.InnerHtml = "New article content. <br /> Second article sentence."; article.ClassList.Add("small-article"); article.Id = "news-article"; article.SetAttribute("data-category", "news"); article.RemoveAttribute("data-category");
By using the SetAttribute()
and RemoveAttribute()
methods, we can use AngleSharp to manipulate the DOM elements.
Other Advanced Capabilities
AngleSharp extends its capabilities beyond standard HTML parsing and DOM manipulation, offering advanced features like form submission and script execution using C#. With built-in mechanisms for form interaction, we can seamlessly simulate user inputs, submit forms, and capture resulting changes to the DOM.
Furthermore, AngleSharp’s script execution support allows us to interpret and execute embedded scripts in HTML documents, enabling us to manipulate the web content dynamically. These features make AngleSharp a compelling choice for applications requiring not just static analysis but also dynamic interaction with HTML-based interfaces.
Comparison With Other Libraries
In the .NET ecosystem, AngleSharp distinguishes itself among HTML parsing libraries for its comprehensive feature set and adherence to web standards.
Popular alternatives include HtmlAgilityPack, widely known for its simplicity and robustness in handling malformed HTML but lacking some of the advanced features offered by AngleSharp.
Another library, CsQuery, emphasizes jQuery-like syntax for DOM manipulation, providing us with a familiar paradigm.
However, AngleSharp sets itself apart by combining a powerful CSS selector engine with advanced form submission and script execution support. It is an excellent choice for web parsing and manipulation tasks in .NET applications.
Conclusion
In this article, we’ve learned how to utilize the AngleSharp library to perform various HTML Parsing operations in C#.
In summary, AngleSharp stands as a powerful HTML parsing library within the .NET ecosystem, distinguished by its rich set of features. It offers a robust CSS selector engine, form submission capabilities, reliable script execution support, and more. Thus, AngleSharp can be an invaluable tool for us when it comes to sophisticated HTML processing tasks in C# applications.