[NTLUG:Discuss] wget
Robert Pearson
e2eiod at gmail.com
Thu Mar 26 12:01:47 CDT 2009
On 3/26/09, Leroy Tennison <leroy_tennison at prodigy.net> wrote:
> Anybody have experience using wget to:
> download a whole site which contains references to other sites
> get it to download only the site specified?
>
> I ran wget without the -H switch and it started downloading other sites.
> Tried using "--exclude-domain ..." which was ignored. Tried using -D
> ... and all wget would download was index.html even though I had
> specified -r -p -l inf and a couple of other switches.
>
> Another option is if anyone can tell me of an easy way to determine
> which files on a web site aren't being referenced. I maintain a web
> site I inherited and there's a lot of "history" which needs to be
> addressed. I also need to find out if a file I need to add to the site
> is in fact already being referenced (and where). Since I have to use
> Windows at work I tried Frontpage for unlinked files, it included a page
> which is referenced in index.html - so much for that approach. Another
> program I found on the Web did the same thing which is when I turned to
> wget only to encounter this problem. Any help would be much appreciated.
>
[From my personal NTLUG email archive of other people's solutions]
[General answer]
I'd like to thank all responders! 'wget' and 'httrack' were both very
useful answers.
'wget' works pretty well for static pages without scripting and for
simplicity, it just works the best. I forget which options I selected, but
I believe they were for 'recursion', 'mirroring' and 'updating links.' I
didn't experience wget going off into other servers or root levels.
'httrack' on the other hand, was VERY thorough. It did go into other
domains. Again, I'm a little hazy about the options I actually selected,
though I do recall trying to limit the range of information it collected.
In the end, I just let it run its course and checked later and erased the
stuff I didn't want.
I will likely use both solutions at different times depending on which
solution fits the need.
[Specific answers]
The reply from Terry to the current question is the majority consensus
solution from the archive... "wget -m"
(1)----------------------------------------------------------------------------
wget --mirror --no-verbose http://some.web.site/
or
wget --mirror --no-verbose http://some.web.site/ 2>&1 | sed -e 's/^.*
URL://' -e 's/ \[.*//'
You must "rm -rf some.web.site" before subsequent runs as it won't
retry existing pages.
NOTE: this will make a local copy of the site at the same time.
(2)----------------------------------------------------------------------------
There's a lot of wget options.. like -np (no parent) and -I and -X that
can be used to control the recursion a bit. No guarantees... but might
help.
(3)----------------------------------------------------------------------------
The one I use most often is '-m' which does what ever is necessary to
"mirror" the site - so it gets you all the pages from that server - but
not things it links to outside of that server.
The '-l' option allows you to restrict the 'depth' of recursive
retrieval.
But I think you may want '-p' which loads the page you request AND
everything necessary to properly display it (images it references,
etc)
(4)----------------------------------------------------------------------------
HTH
More information about the Discuss
mailing list