[NTLUG:Discuss] wget

Wed May 27 10:35:09 CDT 2009

> On Thu, May 21, 2009 at 9:23 AM, Tom Tumelty <tomtumelty at gmail.com> wrote:
>> I am trying to download a website using wget.
>> It always just downloads about 4 files, the image directory, and 4 or 5
>> images.
>>
>> I have tried :
>>
>> wget -r http://www.decaturjetcenter.com
>>
>> wget -rc http://www.decaturjetcenter.com
>> and get same results either way.
>> I don't think the robots.txt file is causing the problem.
>> Any idea what I am doing wrong?
>> Thanks in advance,
>> Tom
>> _______________________________________________

I used this command string, which is adapted from the tips column of
Linux Journal (http://www.linuxjournal.com/content/downloading-entire-web-site-wget)

$wget \
     --recursive \
     --no-clobber \
     --page-requisites \
     --html-extension \
     --convert-links \
     --restrict-file-names=windows \
     --domains decaturjetcenter.com\
     --no-parent \
         decaturjetcenter.com

and pulled down 73 files.  This string will not go outside of the
decaturjetcenter.com domain.  In taking a quick look at a couple
files, there are lots of out-of-domain references, so you may want to
make some adjustments accordingly.  The article referenced above gives
more details.

I have used this many times to pull website data.  In some cases,
depending on the site construction, I have been able to pull all the
data from a site normally requesting a login without needing to do so.

HTH,

Michael