Back in the late seventies, engineers started noticing that DRAM very occasionally developed random one bit errors. The problem with this was that it was impossible to detect when this had happened unless the computer had crashed and it would often lead to bizarre results in data processing. Scientists in various RAM companies came to the conclusion that it was usually simply background radiation energising a single cell in the memory and flipping it from 0 to 1.

How Parity works

How Parity Works

By the era of 30pin (8 bit) SIMMs, it was essential to be able to detect if and when this happened. Therefore, rather than use 8 x 1bit wide chips on a module, the designers added a 9th and stored a parity bit for each byte in it. This allowed single memory errors to be easily detected as the parity bit would be wrong. This practice continued with 72 pin (32bit) SIMMs which had a parity bit added to each byte (ie a parity SIMM was 36 bits wide).

High end manufacturers like Sun and SGI had by this time found the two primary limitations of Parity on memory modules:

  1. Parity can only detect a single error – if there are two or more errors, the Parity bit stands a 50% chance of being correct, and the error will go undetected. This wasn’t a major problem until memory ICs more than one bit wide started appearing, but rapidly became one.
  2. Parity can only detect errors – it can’t correct them.

Therefore these manufacturers put a second parity chip on custom modules and used it to detect parity on each 16bit combination of two adjoining bytes. This solved the first issue but not the second.

ECC Memory Module

ECC modules have an odd number of chips - in this case 9.

By the advent of DIMMs, more computing power was available and there was the potential for a new system. Rather than using the extra one bit per byte for storing a simple parity bit, the extra byte was instead used to store a Hamming code for the entire 64 bit word. This allowed single bit errors to be corrected and dual bit errors to be correctly detected.

Unless something goes physically faulty, it is therefore unlikely that the system will encounter an error that will cause it to crash – and indeed the only time that most users of machines with ECC will know that it’s encountered an error is if they check the BIOS log.

There still remain, however, a major of issue that ECC memory doesn’t resolve. Should more than one bit in a byte disappear, it’s still reset time – and individual chips are more likely to see two bit errors than two bits acquiring errors in separate chips.

A demonstration of interleaved memory

Interleaved memory distributes the bits of each byte to multiple chips and or modules to increase resilience.

Interleaving is one way of reducing the chances of this happening. This is where individual bytes are written to one or more modules in a different order to what one would expect in order to reduce the chances of irrecoverable errors.

Advanced ECC (or Chipkill in HP parlance) is an extension of this principle which relies on using “by 8″ chips (so the module is exactly 9 chips wide) and inteleaving both the data and the Hamming codes. This allows the failure of an entire chip on the module without the loss of any data.

Ultimately, for most users, ECC is of limited benefit as the chances of encountering recoverable memory errors is minutely small. However, when data is critical, such as in a server or high end workstation, the extra cost is easily justified for the security it brings.

Tagged with:
 

One Response to What is ECC and why would I want it?

  1. [...] is reduced – 70% full now equates to a value of 2 (or 10 in binary…). Therefore the ECC functions on the flash chip become much more important and the potential for corruption is much higher. [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>