[NTLUG:Discuss] "You Don't Exist: Go Away!"

Sat Nov 4 06:11:40 CST 2000

On Fri, 03 Nov 2000 07:14:56 CST, the world broke into rejoicing as
MadHat <madhat at unspecific.com>  said:
>Christopher Browne wrote:
>>On Thu, 02 Nov 2000 10:25:05 CST, the world broke into rejoicing as
>>MadHat <madhat at unspecific.com> said:
>>>MadHat wrote:
>>>>Christopher Browne wrote:
>>>>>I am having an unfortunate situation where a machine periodically gets
>>>>>somewhat 'wedged up' such that:
>>>>>  a) Port services that check for user IDs die;
>>>>>  b) Permissions on files apparently "disappear";
>>>>>  c) Pretty much anything that checks IDs against /etc/passwd gets
>>>>>     hosed.
>>>>>
>>>>>This does _not_ appear to be the result of a hack; it seems moderately
>>>>>"time based," probably relating to some resource filling up thereby
>>>>>making {utmp|PAM} throw up.
>>>>>
>>>>>Other interesting facts:
>>>>>- It seems to happen _around_ once a day.  But not greatly predictable.
>>>>>   Oct 27, 03:38
>>>>>   Oct 27, 21:32
>>>>>   Oct 30, 08:02
>>>>>   Oct 31, about 1:56am
>>>>>   Nov 1, between 9:00 and 9:02 pm.
>>>>>
>>>>>- I don't need to reboot to get everything to "reset;" if I drop to
>>>>>  runlevel 1 via "init 1," and then head back to "init 3", this seems
>>>>>  to suffice to clear things up.
>>>>>
>>>>>- Debian Unstable Pretty Much Up To Date.
>>>>>  Linux knuth 2.2.14 #5 Sat May 6 07:29:45 CDT 2000 i586 unknown
>>>>>
>>>>>The two things I've seen looking on Google that match the symptoms are:
>>>>>
>>>>>a) "Oops.  You deleted /etc/passwd."
>>>>>
>>>>>   Not the case.
>>>>>
>>>>>b) Something vague involving utmp being "somehow messed up."
>>>>>
>>>>>Anyone run into this sort of thing before?
>>>>
>>>>kind of...  My problem was bad nodes on the drive, but it took a fsck t
o
>>>>fix.  The drive was going bad and was losing data on the section of the
>>>>disk that held the /etc.
>>>>
>>>>Because an 'init 1' & 'init 3' seem to be the p[roblem, that does point
>>>>more towards the software not hardware...  what Kernel you running?
>>>
>>>This should have read because the init 1 and init 3 seem to _FIX_ the
>>>problem, that doesn't pont towards hardware, but more towards software.
>>>
>>>Need more caffeine.
>>
>>Need more blood with your caffeine level?  I follow that...
>>
>>It certainly seems to be a software issue, and the fact that changing
>>runlevels "fixes" it seems suggestive that the problem is not with the kern
el.
>> (2.2.14, as mentioned up there somewhere...)
>
>D'OH!!!  sorry...  I was hoping it was something newer, so we could
>blame that. [[|:^)
>
>init to another level doesn't remount any file systems, correct?  It
>just starts and stops daemons right?  what deamons are you running?  Is
>there something there that might be causing a problem?  anything running
>in a cron that might be causing?  

It just "bit" again, and I headed to /etc/rc3.d to see the full list
of services that might get restarted when I change runlevels.

Restarted services that either appeared to _potentially_ relate, or
that I couldn't identify.

Bingo!  
       nscd 2.1.96-1; the GLIBC Name Service Cache Daemon

Looking at <http://www.debian.org/Bugs/> for reports on nscd provides
some quite useful fodder; apparently this cache daemon can get
"wedged" after a while if crond is hitting it every minute to check
user IDs.

On the one hand, nscd appears to be an optional component; I'm turning
it off for a while to see how things run with it _shut off_; that
seems to be a benign change thus far, and may work out OK.

Otherwise, I may set up a cfengine rule to periodically check to see
if it's running well, and restart the service if it is "unhappy."
--
cbbrowne at ntlug.org - <http://www.hex.net/~cbbrowne/linux.html>
Rules of the Evil Overlord #78. "I will not tell my Legions of Terror
"And he must be taken alive!" The command will be: ``And try to take him
alive if it is reasonably practical.''" <http://www.eviloverlord.com/>