memtester result with ECC disable
memtester version 4.0.8 (64-bit)
Copyright (C) 2007 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 7000MB (7340032000 bytes)
got 7000MB (7340032000 bytes), trying mlock ...locked.
Loop 1:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : testing 123FAILURE: 0x7b7b7b7b7b7b7b7b != 0x7a7b7b7b7b7b7b7b at offset 0x062b25e3.
Checkerboard : ok
....
on loading large amount of memory
EDAC k8 MC1: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
EDAC MC0: CE page 0xfc5b, offset 0x7d0, grain 8, syndrome 0xf654, row 2, channel 1, label "": k8_edac
EDAC k8 MC1: extended error code: ECC chipkill x4 error
EDAC k8 MC0: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
EDAC MC0: CE page 0x1cfc16, offset 0x6c0, grain 8, syndrome 0x4472, row 0, channel 1, label "": k8_edac
EDAC k8 MC0: extended error code: ECC chipkill x4 error
EDAC k8 MC1: general bus error: participating processor(local node response), time-out(no timeout) memory transaction type(generic read), mem or i/o(mem access), cache level(generic)
EDAC MC0: CE page 0xea59, offset 0x830, grain 8, syndrome 0xf654, row 2, channel 1, label "": k8_edac
EDAC k8 MC1: extended error code: ECC chipkill x4 error
identify the DIMM
Re: EDAC chipkill messagesMachine Check Exceptionmachine check events
what does it mean ?to show the contents$ /usr/sbin/mcelog
[x86_64] how worried should I be about MCEs?EDAC options in kernel and bios
EDAC ProjectChipkill Advanced ECC - Overview of How It WorksSpeed vs. PrecisionBIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD Opteron Processorscorrected ecc error
--------------------
| An Overview of ECC |
--------------------
Introduction
------------
The scope of this discussion is limited to soft and hard errors that
occur in memory and how they are reported by Solaris. It does not
account for errors that occur while data travels through the E10000
interconnect, CPU Module, or I/O. For this discussion, soft errors
are transient or temporary errors in memory that can be corrected by
rewriting the affected memory cell. Hard errors occur when a cell
is permanently damaged and cannot hold the correct information. With
a hard error, the cell can be permanently stuck-at "0", or "1".
ECC Concepts
------------
Any volatile storage medium, whether it be the Dynamic Random Access
Memory (DRAM) used on main memory DIMMs or Static Random Access Memory
(SRAM) mainly used for caches, is subject to occasional natural
incidences of data loss due to the impact of alpha particles or cosmic
rays. This data loss manifests itself in the changing of the value
stored in the memory cell affected by the collision. Typically only a
single bit is affected, but there is a small probability that multiple
cells can be upset.
When a bit flips due to this phenomenon, it is referred to as a soft
error. This is to distinguish it from a hard error resulting from a
hardware failure. These soft errors happen at a rate, called the soft
error rate (SER), that can be predicted as a function of the memory
density, the memory technology, and the altitude of the system in which
the memory resides.
ECC was invented to allow survival from these naturally occurring
losses of data. The ECC method used on the E10000 is called a Single
Error Correcting, Double Error Detecting code (SEC-DED). The concept is
that every word of data is written to memory along with a number of
extra check bits. When the word is read back from memory, a fresh set
of check bits are recomputed and compared with the check that was
stored in memory. The result of this comparison is called the syndrome.
If the syndrome is zero, the comparison was identical, and thus the
data is good. A non-zero syndrome means the data is in error, and the
syndrome is used to find a single bit in error and correct it. A
single bit error is called a Correctable Error (CE). The syndrome can
also detect if two bits are in error, but it does not have enough
information to identify which two bits. This type of error is called
an Uncorrectable Error (UE). UltraSPARC microprocessors use a SEC-DED
variant called S4ED that also can detect, but not correct, three or
four bit errors if they are clustered within a four bit nibble.