[NTLUG:Discuss] wget

Thu Mar 26 12:01:47 CDT 2009

On 3/26/09, Leroy Tennison <leroy_tennison at prodigy.net> wrote:
> Anybody have experience using wget to:
>  download a whole site which contains references to other sites
>  get it to download only the site specified?
>
>  I ran wget without the -H switch and it started downloading other sites.
>   Tried using "--exclude-domain ..." which was ignored.  Tried using -D
>  ... and all wget would download was index.html even though I had
>  specified -r -p -l inf and a couple of other switches.
>
>  Another option is if anyone can tell me of an easy way to determine
>  which files on a web site aren't being referenced.  I maintain a web
>  site I inherited and there's a lot of "history" which needs to be
>  addressed.  I also need to find out if a file I need to add to the site
>  is in fact already being referenced (and where).  Since I have to use
>  Windows at work I tried Frontpage for unlinked files, it included a page
>  which is referenced in index.html - so much for that approach.  Another
>  program I found on the Web did the same thing which is when I turned to
>  wget only to encounter this problem.  Any help would be much appreciated.
>

[From my personal NTLUG email archive of other people's solutions]

[General answer]
I'd like to thank all responders!  'wget' and 'httrack' were both very
useful answers.

'wget' works pretty well for static pages without scripting and for
simplicity, it just works the best.  I forget which options I selected, but
I believe they were for 'recursion', 'mirroring' and 'updating links.'  I
didn't experience wget going off into other servers or root levels.

'httrack' on the other hand, was VERY thorough.  It did go into other
domains.  Again, I'm a little hazy about the options I actually selected,
though I do recall trying to limit the range of information it collected.
In the end, I just let it run its course and checked later and erased the
stuff I didn't want.

I will likely use both solutions at different times depending on which
solution fits the need.

[Specific answers]
The reply from Terry to the current question is the majority consensus
solution from the archive... "wget -m"
(1)----------------------------------------------------------------------------
wget --mirror --no-verbose http://some.web.site/
or
wget --mirror --no-verbose http://some.web.site/ 2>&1 | sed -e 's/^.*
URL://' -e 's/ \[.*//'

You must "rm -rf some.web.site" before subsequent runs as it won't
retry existing pages.

NOTE: this will make a local copy of the site at the same time.
(2)----------------------------------------------------------------------------
There's a lot of wget options.. like -np (no parent) and -I and -X that
can be used to control the recursion a bit.  No guarantees... but might
help.
(3)----------------------------------------------------------------------------
The one I use most often is '-m' which does what ever is necessary to
"mirror" the site - so it gets you all the pages from that server - but
not things it links to outside of that server.

The '-l' option allows you to restrict the 'depth' of recursive
retrieval.

But I think you may want '-p' which loads the page you request AND
everything necessary to properly display it (images it references,
etc)
(4)----------------------------------------------------------------------------
HTH