[NTLUG:Discuss] pulling tables out of web pages.

David Camm dcamm at advwebsys.com
Wed Sep 15 16:49:39 CDT 2004


bobby wrote:
 >I have tried some html2txt tools and have had no success.
 >
 > I need to convert a web page into a tab delimited file (preferably
 > keeping only the data table). My goal is to do several of these pages
 > and cat them into a big table and delete duplicates.
 >
 > I think I can handle most of the problem if I can just convert the html
 > to a tab delimited text file.
 >
 > Anyone know of a reliable tool?
 >
 > Here is a sample of the web pages I am working on:
 > http://partsurfer.hp.com/cgi-bin/spi/main?sel_flg=partlist&model=K

 >AYAK+XU+6%2F266MT&HP_model=&modname=Kayak+XU+6%2F266MT&template=secondary&pl
 >ist_sval=ALL&plist_styp=flag&dealer_id=&callingsite=&keysel=X&catsel=X&ptyps
 >el=X&strsrch=&pictype=I&picture=X&uniqpic=

 >TIA
 >Bobby

one heck of a url, sir :-)

i'm not aware of any general purpose tools that can do what you want. we have 
written many screen-scraper programs in perl (with permission of the pages 
owner, of course) to grab a document using lwp-request and then parse the result.

this particular page presents some interesting problems since the table row 
containing the part name and description cell have embedded tables.

unless someone on the list knows of a tool that parses html and returns the 
contents of specific structures, i'm afraid you're in for some custom programming.

david camm
advanced web systems




More information about the Discuss mailing list