[NTLUG:Discuss] writing to a USB drive locks up server..

Richard ntlug at rain4us.net
Wed Aug 20 23:56:44 CDT 2008


While I've seen my dirvish banks running on reiserfs formatted drives 
get corrupt and lock up a server, I had never seen it with ext2/3 
drives.   I THOUGHT I had just run across an ext2/3 file system 
corruption = server hang but now I'm beginning to wonder.

The dirvish restore to the new server hardware went smooth (and mostly 
without hiccups -- there were a few drivers I had to compile for the new 
SCSI card and nics) and I was looking forward to smooth sailing.  
Unfortunately that hasn't been the case.  I have been having issues on 
this new hardware.  Attempting heavy write access to a USB drive 
containing one of my banks causes the server to lockup.   The file 
system contained errors and since I had a backup copy of the vaults in 
that bank I decided to backup my vault configs and reformat the drive 
fresh.  I had previously reformatted reiserfs filesystems to 'fix' 
corruption that caused lockups and I was surprised that I seemed to have 
the same issue with ext2/3.  When I kicked off a reformat on the dirvish 
bank drive, the server wrote about 147 of it's inode allocations and 
then the server just paused.  At first the server was still pingable, 
but that quickly deteriorated.

The numlock worked but the console was unresponsive.   Use of the Magic 
SysRq commands allowed me to Sync, Unmount and reBoot the server mostly 
gracefully but now I am wondering what technical situations could lead 
to a server hanging on USB disk access.

The Slackware 9.1 w/2.4.20 kernel that is running was stable on the old 
hardware(yes, I know...that was the OLD hardware)...I fear that a kernel 
upgrade will be necessary on this new hardware but I'm hoping someone 
else on the list has seen a problem similar to this one and can offer 
suggestions.  I am not looking forward to dealing with getting an 
upgraded kernel patched , compatible and ready to run 
Dead-Gateway-Detection (DGD), mppe, and uml processes only to find that 
the problem is hardware related, BIOS setting related or some other such 
cause.

Troubleshooting steps taken so far include removing the add-on USB PCI 
card (to solve the NMI errors that were being seen in the log files) and 
disabling SMP in the kernel( so that processes on the server would quit 
going into Un-interruptible sleep mode(D)).

There were no syslog/debug/messages log entries during this latest 
'lockup during a mke2fs' incident.

What are my next troubleshooting steps?
Can one surgically upgrade just usb drivers and utilities?

-- 
Richard



More information about the Discuss mailing list