[NTLUG:Discuss] pulling tables out of web pages.
Kevin Brannen
kbrannen at pwhome.com
Wed Sep 15 22:14:42 CDT 2004
Rob Apodaca wrote:
>> >I have tried some html2txt tools and have had no success.
>> >
>> > I need to convert a web page into a tab delimited file (preferably
>> > keeping only the data table). My goal is to do several of these pages
>> > and cat them into a big table and delete duplicates.
>> >
>> > I think I can handle most of the problem if I can just convert the html
>> > to a tab delimited text file.
>> >
>> > Anyone know of a reliable tool?
>>
>>
>
>I think the perl module HTML-TableContentParser is what you want:
>http://search.cpan.org/~sdrabble/HTML-TableContentParser-0.13/TableContentParser.pm
>
>or check cpan for other modules.
>
>
That is where I'd send you. There's several Perl modules take in HTML
and load it into an object/tree structure, and then let you pull the
elements back out in an organized fashion. I think I've even done
something similiar with HTML::Parser, but it's been some time so I'm
having a hard time remembering. :-) Check out the various HTML::*
modules if you know Perl.
HTH,
Kevin
More information about the Discuss
mailing list