[NTLUG:Discuss] pulling tables out of web pages.

Thu Apr 8 15:57:50 CDT 2004

For crud like this, I usually use AWK.

Just convert each <td> to a space, each </td> to a ',' and each '</tr>' to a
new line.  Throw everything else between <>s away.

{
  gsub(/<[Tt][Dd][^>]+>/, "")
  gsub(/<\/[Tt][Dd]>/, ",")
  gsub(/<\/[Tt][Rr]>/, "\n")
  gsub(/<[^>]*>/, "")
  print $0
}

Then I usually fancy it up with a read-ahead loop to handle tags not closed
on a single line, etc.  And something to delete scripts, comments, etc.

If you are able to pay for this, contact me off list.

-----Burton

> -----Original Message-----
> From: discuss-bounces at ntlug.org [mailto:discuss-bounces at ntlug.org]On
> Behalf Of Bobby Wrenn
> Sent: Thursday, April 08, 2004 2:34 PM
> To: NTLUG Discussion List
> Subject: [NTLUG:Discuss] pulling tables out of web pages.
>
>
> I have tried some html2txt tools and have had no success.
>
> I need to convert a web page into a tab delimited file (preferably
> keeping only the data table). My goal is to do several of these pages
> and cat them into a big table and delete duplicates.
>
> I think I can handle most of the problem if I can just convert the html
> to a tab delimited text file.
>
> Anyone know of a reliable tool?
>
> Here is a sample of the web pages I am working on:
> http://partsurfer.hp.com/cgi-bin/spi/main?sel_flg=partlist&model=K
> AYAK+XU+6%2F266MT&HP_model=&modname=Kayak+XU+6%2F266MT&template=se
condary&plist_sval=ALL&plist_styp=flag&dealer_id=&callingsite=&keysel=X&cats
el=X&pt> ypsel=X&strsrch=&pictype=I&picture=X&uniqpic=
>
> TIA
> Bobby
>
>
> _______________________________________________
> https://ntlug.org/mailman/listinfo/discuss