HTML Agility Pack Selectors

This entry is part 4 of 5 in the series HTML Agility Pack

Selectors allow you to select HTML nodes from an HtmlDocument. Once you have loaded in the HTML document you can select an individual node or several nodes. A node is an HTML tag. For example, you can select all of the paragraph tags, all of the table data tags all of the div tags and so on.

There are two methods you can use to select nodes: SelectNodes() and SelectSingleNode(string).

This post will demonstrate several examples of selecting. We have a C# program that uses a few HTML strings that produces the following console output. It’s a little bit verbose here but the code that we have covers a few useful examples.

Selection Tests for XPath with HTML Agility Pack in C#.
-------------------------------------------------------

1st paragraph SelectSingleNode //p[1] in HTML1:
1: First paragraph

Third paragraph SelectSingleNode //p[3] in HTML1:
1: P 3

First paragraph inside the first div tag SelectSingleNode //div/p in HTML2:
2: parag inside div tag

All paragraphs SelectNodes //p in HTML1:
2: paragraph 1
2: Paragraph 2.
2: parag inside div tag

All paragraphs inside the second div tag SelectNodes //div[2]/p in HTML5:
5: Paragraph three in 2nd div
5: paragraph four in second div

All paragraphs SelectNodes //p containing the case-insensitive string 'par' and '2' in HTML5:
5: This is paragraph 2
5: Paragraph three in 2nd div

Write all div nodes //div in HTML1 to console (there are none):

All paragraphs with class of pclass SelectNodes //p[@class='pclass'] in HTML3:
3: First paragraph class pclass
3: P 3 with class pclass

All tags with class of pclass SelectNodes //*[@class='pclass'] in HTML3:
3: First paragraph class pclass
3: P 3 with class pclass
3: Heading 1 class is pclass

End of XPath demonstration. Press any key to end program.

Here is the C# code. Also, I created a class that holds the strings of HTML.

using System;
using HtmlAgilityPack;
namespace HTMLAgilityXPath
{
    class Program
    {
        static void Main(string[] args)
        {   
            var hs = new HTMLStrings();
            var doc = new HtmlDocument();
            doc.LoadHtml(hs.HTML1);
            // =========================================================================
            Console.WriteLine("Selection Tests for XPath with HTML Agility Pack in C#.");
            Console.WriteLine("-------------------------------------------------------");
            // ========================================================================
            Console.WriteLine("\n1st paragraph SelectSingleNode //p[1] in HTML1:");
            Console.WriteLine(doc.DocumentNode.SelectSingleNode("//p").InnerText);
            Console.WriteLine("\nThird paragraph SelectSingleNode //p[3] in HTML1:");
            Console.WriteLine(doc.DocumentNode.SelectSingleNode("//p[3]").InnerText);
            // =================================================================================
            Console.WriteLine("\nFirst paragraph inside the first div tag SelectSingleNode //div/p in HTML2:");
            doc.LoadHtml(hs.HTML2);
            Console.WriteLine(doc.DocumentNode.SelectSingleNode("//div/p").InnerText);
            // =================================================================================
            Console.WriteLine("\nAll paragraphs SelectNodes //p in HTML1:");
            foreach (HtmlNode pg in doc.DocumentNode.SelectNodes("//p"))
            {
                Console.WriteLine(pg.InnerText);
            }
            // ==================================================================================
            Console.WriteLine("\nAll paragraphs inside the second div tag SelectNodes //div[2]/p in HTML5:");
            doc.LoadHtml(hs.HTML5);
            HtmlNodeCollection nodesdivp = doc.DocumentNode.SelectNodes("//div[2]/p");
            foreach (HtmlNode node in nodesdivp)
            {
                Console.WriteLine(node.InnerText);
            }
            // ===================================================================================
            Console.WriteLine("\nAll paragraphs SelectNodes //p containing the case-insensitive string 'par' and '2' in HTML5:");
            string parag = "";
            bool bContains;
            bool bContains2;
            foreach (HtmlNode pg in doc.DocumentNode.SelectNodes("//p"))
            {
                var strToSearchFor = "PAR";
                var strToSearchFor2 = "2";
                parag = pg.InnerText.ToUpper();
                bContains = parag.Contains(strToSearchFor);
                bContains2 = parag.Contains(strToSearchFor2);
                if (bContains && bContains2) 
                {
                    Console.WriteLine(pg.InnerText);
                }
            }
            // ====================================================================================
            Console.WriteLine("\nWrite all div nodes //div in HTML1 to console (there are none):");
            doc.LoadHtml(hs.HTML1);
            HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//div");
            if (nodes != null)
            {
                foreach (HtmlNode node in nodes) { Console.WriteLine(node.InnerText); }
            }
            // ====================================================================================
            Console.WriteLine("\nAll paragraphs with class of pclass SelectNodes //p[@class='pclass'] in HTML3:");
            doc.LoadHtml(hs.HTML3);
            HtmlNodeCollection nodespclass = doc.DocumentNode.SelectNodes("//p[@class='pclass']");
            foreach (HtmlNode node in nodespclass)
            {
                Console.WriteLine(node.InnerText);
            }
            // =====================================================================================
            Console.WriteLine("\nAll tags with class of pclass SelectNodes //*[@class='pclass'] in HTML3:");
            //doc.LoadHtml(hs.HTML3);
            HtmlNodeCollection nodespcls = doc.DocumentNode.SelectNodes("//*[@class='pclass']");
            foreach (HtmlNode node in nodespcls)
            {
                Console.WriteLine(node.InnerText);
            }
            Console.WriteLine("\nEnd of XPath demonstration. Press any key to end program.");
            Console.ReadKey(); 
        }
    }
}

Below is the class we have that initializes the properties in the constructor.

space HTMLAgilityXPath
{
    class HTMLStrings
    {
        public string HTML1 { get; set; }
        public string HTML2 { get; set; }
        public string HTML3 { get; set; }
        public string HTML4 { get; set; }
        public string HTML5 { get; set; }

        public HTMLStrings()
        {
            HTML1 = "<html><body><p>1: First paragraph</p><p>1: Paragraph 2.</p><p>1: P 3</p></body></html>";
            HTML2 = "<html><body><p>2: paragraph 1</p><p>2: Paragraph 2.</p><div><p>2: parag inside div tag</p></div></body></html>";
            HTML3 = "<html><body><p class=\"pclass\">3: First paragraph class pclass</p><p>3: Paragraph 2.</p><p class=\"pclass\">3: P 3 with class pclass</p><h1 class=\"pclass\">3: " +
                "Heading 1 class is pclass</h1></body></html>";
            HTML4 = "";
            HTML5 = @"<html>
                <body>
                   <p>5: This is a paragraph</p>
                   <p>5: This is paragraph 2</p>
                   <div>
                       <table>
                          <tr><th>name</th><th>number</th><th>lastname</th></tr>
                          <tr><td>5: bob</td><td>5: 23</th><td>5: smith</td></tr>
                          <tr><td>5: sammy</td><td>5: 77</th><td>5: davis jr.</td></tr>
                       </table>
                   </div>
                   <div>
                       <p>5: Paragraph three in 2nd div</p>
                       <p>5: paragraph four in second div</p>
                       <ul>
                            <li>5: 1.1 unordered list item one</li>
                            <li>5: 1.2 unordered list item two</li>
                       </ul>
                   </div>
                   <ul>
                        <li>5: 2.1 unordered list item one</li>
                        <li>5: 2.2 unordered list item two</li>
                   </ul>
                </body>
            </html>";
        }
    }
}

Series Navigation<< HTML File Agility Pack Table DataHTML File Table Extractor >>

BeginCodingNow.com

for data analysts & software developers

for data analysts & software developers