HTML Agility Pack Website Tables


This entry is part 2 of 5 in the series HTML Agility Pack

In this post we have a C# console program example that scrapes the data in a table from this site, BeginCodingNow.com, and creates a new file in the C:\temp folder that contains the data in a table. You need four things to do this for yourself: Visual Studio, the code on this post (feel free to copy it), a C: drive on your computer with a folder called temp and an Internet connection. At this website, BeginCodingNow.com, we have a small table that lists SQL Server Integer Data Types that we will scrape programmatically using this C# program with the HTML Agility Pack installed into Visual Studio.

We have another post on the Agility Pack here at begincodingnow.com called HTML File Agility Pack Table Data.

Below is the C# code we have used to do this. You can see that the source URL is hard-coded into the solution along with the destination file name. This is to make the solution shorter for this post. You can change these two hard-coded values and run it again for a different web page and a different destination file. You can also change the delimiter to a comma if you wish, however, if there are commas inside the web page’s table, your columns won’t line up. As you can see, using the Agility Pack has made the code very short and simple. To learn how to install the HTML Agility Pack, go to the post called HTML Agility Pack.

using System;
using System.IO;
using HtmlAgilityPack;

namespace AgilityScrapeTableMikeWeb
{
    class Program
    {
        static void Main(string[] args)
        {
            string strLineOut = "";
            string pathOut = "C:\\temp\\sqlintegerdatatypes.csv";
            string delimiter = "|";   // often we use a comma for this
            var html = @"https://begincodingnow.com/sql-server-integer-data-types/";
            // =========================================
            Console.WriteLine("Scraping tables from: " + html);
            HtmlWeb doc = new HtmlWeb();
            var htmlDoc = doc.Load(html);
            // now its time to output our data with SteamWriter and the Console.
            using (var sw = new StreamWriter(pathOut, true))
            {
                foreach (HtmlNode table in htmlDoc.DocumentNode.SelectNodes("//table"))
                {
                    Console.WriteLine("\nFound: " + table.Name);
                    //sw.WriteLine("\nFound: " + table.Name);
                    foreach (HtmlNode row in table.SelectNodes("//tr"))
                    {
                        strLineOut = "";
                        Console.WriteLine("");
                        foreach (HtmlNode cell in row.SelectNodes("th|td"))
                        {
                            Console.Write(cell.InnerText + delimiter);
                            strLineOut = strLineOut + cell.InnerText + delimiter;
                        }
                        // remove last separator from string as it creates column at end
                        strLineOut = strLineOut.Substring(0, strLineOut.Length - 1);
                        sw.WriteLine(strLineOut);
                    }
                }
            }
            Console.WriteLine("\n\nDone!  " + pathOut);
            Console.WriteLine("Press any key to end this program.");
            Console.ReadKey();
        }
    }
}

What does the output look like in the console?

Scraping tables from: https://begincodingnow.com/sql-server-integer-data-types/

Found: table

Data type|Range|Storage|
bigint|-2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807)|8 Bytes|
int|-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)|4 Bytes|
smallint|-2^15 (-32,768) to 2^15-1 (32,767)|2 Bytes|
tinyint|0 to 255|1 Byte|

Done!  C:\temp\sqlintegerdatatypes.csv
Press any key to end this program.

What does the output file look like?

Data type|Range|Storage
bigint|-2^63 (-9,223,372,036,854,775,808) to 2^63-1 (9,223,372,036,854,775,807)|8 Bytes
int|-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)|4 Bytes
smallint|-2^15 (-32,768) to 2^15-1 (32,767)|2 Bytes
tinyint|0 to 255|1 Byte

Once you have this delimited text file, you can open MS Excel and import it. Open an Excel file and go to the Data menu and click the button From Text/CSV. Follow the wizard to load the data.

Here is what the table looks like at this website, after I reduced the size in the Chrome browser by using the keyboard combination Ctrl+-.

Here is what the scraped data looks like after importing it into Excel. Very powerful.

Series Navigation<< HTML Agility PackHTML File Agility Pack Table Data >>