[NTLUG:Discuss] pulling tables out of web pages.
Neil Aggarwal
neil at JAMMConsulting.com
Wed Sep 15 23:56:55 CDT 2004
Rob:
There are also many Java libraries to parse HTML.
For example:
http://htmlparser.sourceforge.net/
does a great job. It gracefully handles HTML that other parses
just barf on.
Hope this helps,
Neil
--
Neil Aggarwal, JAMM Consulting, (972)612-6056, www.JAMMConsulting.com
FREE! Valuable info on how your business can reduce operating costs by
17% or more in 6 months or less! http://newsletter.JAMMConsulting.com
> -----Original Message-----
> From: discuss-bounces at ntlug.org
> [mailto:discuss-bounces at ntlug.org] On Behalf Of Kevin Brannen
> Sent: Wednesday, September 15, 2004 10:15 PM
> To: NTLUG Discussion List
> Subject: Re: [NTLUG:Discuss] pulling tables out of web pages.
>
>
> Rob Apodaca wrote:
>
> >> >I have tried some html2txt tools and have had no success.
> >> >
> >> > I need to convert a web page into a tab delimited file
> (preferably
> >> > keeping only the data table). My goal is to do several
> of these pages
> >> > and cat them into a big table and delete duplicates.
> >> >
> >> > I think I can handle most of the problem if I can just
> convert the html
> >> > to a tab delimited text file.
> >> >
> >> > Anyone know of a reliable tool?
> >>
> >>
> >
> >I think the perl module HTML-TableContentParser is what you want:
> >http://search.cpan.org/~sdrabble/HTML-TableContentParser-0.13
/TableContentParser.pm
>
>or check cpan for other modules.
>
>
That is where I'd send you. There's several Perl modules take in HTML
and load it into an object/tree structure, and then let you pull the
elements back out in an organized fashion. I think I've even done
something similiar with HTML::Parser, but it's been some time so I'm
having a hard time remembering. :-) Check out the various HTML::*
modules if you know Perl.
HTH,
Kevin
_______________________________________________
https://ntlug.org/mailman/listinfo/discuss
More information about the Discuss
mailing list