[NTLUG:Discuss] wget
Michael Barnes
barnmichael at gmail.com
Wed May 27 10:35:09 CDT 2009
> On Thu, May 21, 2009 at 9:23 AM, Tom Tumelty <tomtumelty at gmail.com> wrote:
>> I am trying to download a website using wget.
>> It always just downloads about 4 files, the image directory, and 4 or 5
>> images.
>>
>> I have tried :
>>
>> wget -r http://www.decaturjetcenter.com
>>
>> wget -rc http://www.decaturjetcenter.com
>> and get same results either way.
>> I don't think the robots.txt file is causing the problem.
>> Any idea what I am doing wrong?
>> Thanks in advance,
>> Tom
>> _______________________________________________
I used this command string, which is adapted from the tips column of
Linux Journal (http://www.linuxjournal.com/content/downloading-entire-web-site-wget)
$wget \
--recursive \
--no-clobber \
--page-requisites \
--html-extension \
--convert-links \
--restrict-file-names=windows \
--domains decaturjetcenter.com\
--no-parent \
decaturjetcenter.com
and pulled down 73 files. This string will not go outside of the
decaturjetcenter.com domain. In taking a quick look at a couple
files, there are lots of out-of-domain references, so you may want to
make some adjustments accordingly. The article referenced above gives
more details.
I have used this many times to pull website data. In some cases,
depending on the site construction, I have been able to pull all the
data from a site normally requesting a login without needing to do so.
HTH,
Michael
More information about the Discuss
mailing list