- HTML Agility Pack
- HTML Agility Pack Website Tables
- HTML File Agility Pack Table Data
- HTML Agility Pack Selectors
- HTML File Table Extractor
Selectors allow you to select HTML nodes from an HtmlDocument. Once you have loaded in the HTML document you can select an individual node or several nodes. A node is an HTML tag. For example, you can select all of the paragraph tags, all of the table data tags all of the div tags and so on.
There are two methods you can use to select nodes: SelectNodes() and SelectSingleNode(string).
This post will demonstrate several examples of selecting. We have a C# program that uses a few HTML strings that produces the following console output. It’s a little bit verbose here but the code that we have covers a few useful examples.
Selection Tests for XPath with HTML Agility Pack in C#. ------------------------------------------------------- 1st paragraph SelectSingleNode //p[1] in HTML1: 1: First paragraph Third paragraph SelectSingleNode //p[3] in HTML1: 1: P 3 First paragraph inside the first div tag SelectSingleNode //div/p in HTML2: 2: parag inside div tag All paragraphs SelectNodes //p in HTML1: 2: paragraph 1 2: Paragraph 2. 2: parag inside div tag All paragraphs inside the second div tag SelectNodes //div[2]/p in HTML5: 5: Paragraph three in 2nd div 5: paragraph four in second div All paragraphs SelectNodes //p containing the case-insensitive string 'par' and '2' in HTML5: 5: This is paragraph 2 5: Paragraph three in 2nd div Write all div nodes //div in HTML1 to console (there are none): All paragraphs with class of pclass SelectNodes //p[@class='pclass'] in HTML3: 3: First paragraph class pclass 3: P 3 with class pclass All tags with class of pclass SelectNodes //*[@class='pclass'] in HTML3: 3: First paragraph class pclass 3: P 3 with class pclass 3: Heading 1 class is pclass End of XPath demonstration. Press any key to end program.
Here is the C# code. Also, I created a class that holds the strings of HTML.
using System; using HtmlAgilityPack; namespace HTMLAgilityXPath { class Program { static void Main(string[] args) { var hs = new HTMLStrings(); var doc = new HtmlDocument(); doc.LoadHtml(hs.HTML1); // ========================================================================= Console.WriteLine("Selection Tests for XPath with HTML Agility Pack in C#."); Console.WriteLine("-------------------------------------------------------"); // ======================================================================== Console.WriteLine("\n1st paragraph SelectSingleNode //p[1] in HTML1:"); Console.WriteLine(doc.DocumentNode.SelectSingleNode("//p").InnerText); Console.WriteLine("\nThird paragraph SelectSingleNode //p[3] in HTML1:"); Console.WriteLine(doc.DocumentNode.SelectSingleNode("//p[3]").InnerText); // ================================================================================= Console.WriteLine("\nFirst paragraph inside the first div tag SelectSingleNode //div/p in HTML2:"); doc.LoadHtml(hs.HTML2); Console.WriteLine(doc.DocumentNode.SelectSingleNode("//div/p").InnerText); // ================================================================================= Console.WriteLine("\nAll paragraphs SelectNodes //p in HTML1:"); foreach (HtmlNode pg in doc.DocumentNode.SelectNodes("//p")) { Console.WriteLine(pg.InnerText); } // ================================================================================== Console.WriteLine("\nAll paragraphs inside the second div tag SelectNodes //div[2]/p in HTML5:"); doc.LoadHtml(hs.HTML5); HtmlNodeCollection nodesdivp = doc.DocumentNode.SelectNodes("//div[2]/p"); foreach (HtmlNode node in nodesdivp) { Console.WriteLine(node.InnerText); } // =================================================================================== Console.WriteLine("\nAll paragraphs SelectNodes //p containing the case-insensitive string 'par' and '2' in HTML5:"); string parag = ""; bool bContains; bool bContains2; foreach (HtmlNode pg in doc.DocumentNode.SelectNodes("//p")) { var strToSearchFor = "PAR"; var strToSearchFor2 = "2"; parag = pg.InnerText.ToUpper(); bContains = parag.Contains(strToSearchFor); bContains2 = parag.Contains(strToSearchFor2); if (bContains && bContains2) { Console.WriteLine(pg.InnerText); } } // ==================================================================================== Console.WriteLine("\nWrite all div nodes //div in HTML1 to console (there are none):"); doc.LoadHtml(hs.HTML1); HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div"); if (nodes != null) { foreach (HtmlNode node in nodes) { Console.WriteLine(node.InnerText); } } // ==================================================================================== Console.WriteLine("\nAll paragraphs with class of pclass SelectNodes //p[@class='pclass'] in HTML3:"); doc.LoadHtml(hs.HTML3); HtmlNodeCollection nodespclass = doc.DocumentNode.SelectNodes("//p[@class='pclass']"); foreach (HtmlNode node in nodespclass) { Console.WriteLine(node.InnerText); } // ===================================================================================== Console.WriteLine("\nAll tags with class of pclass SelectNodes //*[@class='pclass'] in HTML3:"); //doc.LoadHtml(hs.HTML3); HtmlNodeCollection nodespcls = doc.DocumentNode.SelectNodes("//*[@class='pclass']"); foreach (HtmlNode node in nodespcls) { Console.WriteLine(node.InnerText); } Console.WriteLine("\nEnd of XPath demonstration. Press any key to end program."); Console.ReadKey(); } } }
Below is the class we have that initializes the properties in the constructor.
space HTMLAgilityXPath { class HTMLStrings { public string HTML1 { get; set; } public string HTML2 { get; set; } public string HTML3 { get; set; } public string HTML4 { get; set; } public string HTML5 { get; set; } public HTMLStrings() { HTML1 = "<html><body><p>1: First paragraph</p><p>1: Paragraph 2.</p><p>1: P 3</p></body></html>"; HTML2 = "<html><body><p>2: paragraph 1</p><p>2: Paragraph 2.</p><div><p>2: parag inside div tag</p></div></body></html>"; HTML3 = "<html><body><p class=\"pclass\">3: First paragraph class pclass</p><p>3: Paragraph 2.</p><p class=\"pclass\">3: P 3 with class pclass</p><h1 class=\"pclass\">3: " + "Heading 1 class is pclass</h1></body></html>"; HTML4 = ""; HTML5 = @"<html> <body> <p>5: This is a paragraph</p> <p>5: This is paragraph 2</p> <div> <table> <tr><th>name</th><th>number</th><th>lastname</th></tr> <tr><td>5: bob</td><td>5: 23</th><td>5: smith</td></tr> <tr><td>5: sammy</td><td>5: 77</th><td>5: davis jr.</td></tr> </table> </div> <div> <p>5: Paragraph three in 2nd div</p> <p>5: paragraph four in second div</p> <ul> <li>5: 1.1 unordered list item one</li> <li>5: 1.2 unordered list item two</li> </ul> </div> <ul> <li>5: 2.1 unordered list item one</li> <li>5: 2.2 unordered list item two</li> </ul> </body> </html>"; } } }