[NTLUG:Discuss] pulling tables out of web pages.

Kevin Brannen kbrannen at pwhome.com
Wed Sep 15 22:14:42 CDT 2004


Rob Apodaca wrote:

>> >I have tried some html2txt tools and have had no success.
>> >
>> > I need to convert a web page into a tab delimited file (preferably
>> > keeping only the data table). My goal is to do several of these pages
>> > and cat them into a big table and delete duplicates.
>> >
>> > I think I can handle most of the problem if I can just convert the html
>> > to a tab delimited text file.
>> >
>> > Anyone know of a reliable tool?
>>    
>>
>
>I think the perl module HTML-TableContentParser is what you want:
>http://search.cpan.org/~sdrabble/HTML-TableContentParser-0.13/TableContentParser.pm
>
>or check cpan for other modules.
>  
>

That is where I'd send you.  There's several Perl modules take in HTML 
and load it into an object/tree structure, and then let you pull the 
elements back out in an organized fashion.  I think I've even done 
something similiar with HTML::Parser, but it's been some time so I'm 
having a hard time remembering. :-)  Check out the various HTML::* 
modules if you know Perl.

HTH,
Kevin



More information about the Discuss mailing list