From: dmc@kamal.austin.ibm.com (Dave McCracken) Subject: Re: Are any SIMMs cheap these day$ ? Date: Wed, 11 Aug 1993 14:58:05 GMT
In article <2490d8$msm@klaava.Helsinki.FI> torvalds@klaava.Helsinki.FI
(Linus Torvalds) writes:
>
> If it's a hard error, you can naturally just try to access every piece
> of RAM in the system when you get an NMI and hope you get another one
> and can look where that happened. Even then, caching etc can foil this
> plan, so it's not as simple as it was on early PC's. And hard errors
> are seldom the big problem anyway: if it's truly hard, it should have
> been caught by the BIOS POST routines in any case.
The SVR4 kernel does the 'walk through all of memory' trick in an attempt
to find the offending location. The theory here is that many bad bits
will remain bad until rewritten, even though writing them would make them
work again for awhile. In fact, I observed several machines what would
exhibit this behavior. It would pass all memory diagnostics, but occasionally
would fail with NMI, and the scan routine in the NMI handler would find
the offending location. So this theory does appear to have some merit.
> So when linux has so little to go on, it just prints a message ("Uhhuh.
> NMI received. Dazed and confused, but trying to continue" or something
> very similar), and hopes the user does the right thing.
If bad data has been fetched from memory, I think it's a bad idea to continue
running. That may be a critical pointer that could lead to things like
wrong data being written to the file system, or some such nasty error.
I believe the SVR4 response of panic is appropriate.
> (on some machines, a NMI might be harmless: it's possible that a
> portable would use NMI for some powersaving things, for example. I
> wouldn't know)
You are indeed correct. Some notebooks use NMI to report things like
imminent power loss, etc.