HTML-scraping

- July 19, 2011

Ever had a need to process pages from a website that does not support any form of structural system integration like Web Service, RSS, REST, etc.
The only information available is ill-formed HTML; not even XHTML!?I have always been using HTML Agility for the .NET platform to perform such HTML screen-scraping.
Recently, found a number of Java equivalent toolkit to do the same:

TagSoup
JSoup
HTML Parser
HTML Cleaner
NekoHTML

Found this site that collects various toolkits for this purpose here.

Comments

Carly Fiorina said…

Hi all,

Acquiring data displayed on screen by capturing the text manually with the copy command or via software. Web pages are constantly being screen scraped in order to save meaningful data for later use. Thanks for sharing it.....

Data Scraping Software

04/11/2011, 16:18

Search This Blog

SOFTware is HARD

HTML-scraping

Comments

Popular posts from this blog

Understanding ITIL Service Management the UML way…

How to depict (Professional-Looking) Logical Network Diagrams in Astah

Setting Up a Reverse Proxy (HTTP Gateway) using Apache