From: Linus Torvalds (torvalds@klaava.Helsinki.FI)
Date: 08/10/93


From: torvalds@klaava.Helsinki.FI (Linus Torvalds)
Subject: Re: Are any SIMMs cheap these day$ ?
Date: 10 Aug 1993 23:28:24 +0300

In article <6737@sixhub.uucp> davidsen@tmr.com (bill davidsen) writes:
>
>There's no reason to bring the system down just because you catch the
>parity error. If it's in the kernel you probably should, and the
>decision to write buffers which may be corrupted vs just going down is
>one you have to make. If the error is in the user area it gets more
>interesting.

While this would be nice, the problem is that it's essentially
impossible to find out *where* the error occurred on PC machinery: as
far as I can tell, the hardware just sends the NMI (and as others have
mentioned, even that isn't guaranteed), and trying to find out *why* the
NMI happened would seem close to impossible..

If it's a hard error, you can naturally just try to access every piece
of RAM in the system when you get an NMI and hope you get another one
and can look where that happened. Even then, caching etc can foil this
plan, so it's not as simple as it was on early PC's. And hard errors
are seldom the big problem anyway: if it's truly hard, it should have
been caught by the BIOS POST routines in any case.

So, the main reason for parity is to catch soft errors: and when all you
get is a NMI you don't have much to go on. The naive way to handle this
would be to check what we were doing when the NMI came up (disassemble
the instruction that the eip on the stack points to), but in fact you'd
have to try to find the *previous* instruction, and as if that wasn't
enough on a machine with variable-length instructions, you wouldn't know
which part of the instruction caused the error anyway (maybe it was the
instruction itself, maybe it was the prefetch for the next instruction,
maybe it was a DMA read that was going on..)

So when linux has so little to go on, it just prints a message ("Uhhuh.
NMI received. Dazed and confused, but trying to continue" or something
very similar), and hopes the user does the right thing. The right
thing, btw, is usually to get your memory chips exchanged: before this
you might want to move them around to pinpoint the bad one (if there is
just one bad one) and make sure that they are correctly seated.

(on some machines, a NMI might be harmless: it's possible that a
portable would use NMI for some powersaving things, for example. I
wouldn't know)

                Linus