Skip Navigation LinksHome > Categories > Code from a Category

How we can extract data from HTML tables and create a DataSet object containing this data



User Name: codelecturer
Name: Mike Chauhan
Contact Me: www.datawebcoder.com/ContactUs.aspx
Home Page: www.datawebcoder.com
6 years of total IT experience including programming, application development, System Maintenance . 3 years of Experienced and expertise in .Net Framework VB, Visual Basic.Net, ASP.Net 2.0, 3.5 and AD... [More]
Viewed Times: 3107
Add Date: 11/12/2011
An example of how we can extract data from HTML tables and create a DataSet object containing this data.
To recreate the page that we need to scrape, I've created a simple function to build a HTML page containing two tables. You can use this function whilst doing your testing, but I imagine that in a real-life situation you will want to retrieve the HTML directly from the web page, or maybe read all the lines from a locally based file. The function I created looks like this, although feel free to modify this if you need to: Data Extraction
Whichever method we use to retrieve this HTML, we then need to be able to extract the relevant table elements. I decided to use a Regular Expression to do this (adding some options in to make sure that the case and any line breaks were ignored), specifically this one which targets the beginning and end tags:
This will return all of the text in between the tags and will allow us to then apply further Regular Expressions to get the text inside all of the and tags. As some of the tables returned to me had tags, and some didn't, I decided to include a check in the function to see if they did exist. If they did, I would use the text inside these tags for the column names in my DataTable; if they didn't exist, I would simply create a default naming scheme (e.g. Column1, Column2 etc).
Logic
The logic of the function was actually fairly simple and could be broken down into the following "pseudo" steps:
1. Retrieve each instance of the table elements on the page.
2. Loop through each table, performing the following checks.
3. Check for the existence of tags to determine if we know the names of the columns, otherwise just add a default name for each column.
4. Loop through the rows of the table and for each column, add the value to our column in the DataTable.
Implementation
Recreating these steps into a .NET function, I came up with this function named "ConvertHTMLTablesToDataSet" which accepts the full HTML string, performs the actions we identified above and then returns a DataSet with a corresponding DataTable for each HTML table that was found:
Viewing the results
If you want to test this function, you can create a simple .aspx page with a Panel on it:

And then create some dynamic GridView's for each DataTable e.g.

You may also need to include the following Import statements on your page:

When you run this test page in your development environment, if you have used the sample data from the GetHTML function above you should see the tables.

Post a Comment

Name: (Optional)
Email: (Optional, you can get an email if somebody replys your comments)*
Email me if somebody respons my comment below:
Details**:
Enter Text
as Below:
(case insensitive, if hard to read, click the "get a new one" button)
 
    
* Your email address will not be shared with any third parties for any reason.
** Maximum 1000 charactors.