[NTLUG:Discuss] pulling tables out of web pages.

Wed Sep 15 23:56:55 CDT 2004

Rob:

There are also many Java libraries to parse HTML.

For example:
http://htmlparser.sourceforge.net/
does a great job.  It gracefully handles HTML that other parses
just barf on.

Hope this helps,
	Neil

--
Neil Aggarwal, JAMM Consulting, (972)612-6056, www.JAMMConsulting.com
FREE! Valuable info on how your business can reduce operating costs by
17% or more in 6 months or less! http://newsletter.JAMMConsulting.com

> -----Original Message-----
> From: discuss-bounces at ntlug.org 
> [mailto:discuss-bounces at ntlug.org] On Behalf Of Kevin Brannen
> Sent: Wednesday, September 15, 2004 10:15 PM
> To: NTLUG Discussion List
> Subject: Re: [NTLUG:Discuss] pulling tables out of web pages.
> 
> 
> Rob Apodaca wrote:
> 
> >> >I have tried some html2txt tools and have had no success.
> >> >
> >> > I need to convert a web page into a tab delimited file 
> (preferably
> >> > keeping only the data table). My goal is to do several 
> of these pages
> >> > and cat them into a big table and delete duplicates.
> >> >
> >> > I think I can handle most of the problem if I can just 
> convert the html
> >> > to a tab delimited text file.
> >> >
> >> > Anyone know of a reliable tool?
> >>    
> >>
> >
> >I think the perl module HTML-TableContentParser is what you want:
> >http://search.cpan.org/~sdrabble/HTML-TableContentParser-0.13
/TableContentParser.pm
>
>or check cpan for other modules.
>  
>

That is where I'd send you.  There's several Perl modules take in HTML 
and load it into an object/tree structure, and then let you pull the 
elements back out in an organized fashion.  I think I've even done 
something similiar with HTML::Parser, but it's been some time so I'm 
having a hard time remembering. :-)  Check out the various HTML::* 
modules if you know Perl.

HTH,
Kevin

_______________________________________________
https://ntlug.org/mailman/listinfo/discuss