Error Detection and Correction

The basic problem we have to resolve is that memory and communications technology isn't totally reliable; we have to expect and be ready to deal with errors in the hardware. This document will describe two very different technologies for detecting, and maybe correcting, errors that may occur in data storage and transmission.

The first approach to be described is more appropriate for environments like memory: a relatively small amount of data is fetched in parallel. This approach, called "error detecting and correcting codes," is based on defining a distance between two bit strings in terms of the number of bits that have to change to get from the first string to the second. Extra bits are added to each string, which are set so that some minimum number of bits must change to get from one valid string to another. If the received string isn't valid, it is assumed that the correct string is the one "closest" to the received string.

The second approach is more appropriate to environments in which relatively large amounts of data are to be transferred, but they are transferred serially. In this approach a "signature" is appended to the data string; the number of bits in the signature is much less than the number of bits that would be required to do an error correcting code. This approach will lead to adding checksums or cyclic redundancy checks to the string.

Error Detecting and Correcting Codes

R. W. Hamming wrote the paper that both opened and closed this field in 1950. His interest was in providing a means of self-checking in computers, which were just being developed at the time he wrote this. the paper appeared in the Bell System Technical Journal, April, 1950. Definitely worth tracking down in the library and reading.

Bit Strings as Addresses in Binary Hypercubes

The best starting point for understanding ECC codes is to consider bit strings as addresses in a binary hypercube. A hypercube is a generalization of a cube to various dimensions; we're probably most familiar with the notion of a four-dimensional hypercube. Here's a picture of binary hypercubes for several different dimensionalities:

Each of them was created by copying the one to the left twice, and connecting corresponding vertices.

We can assign each vertex in a hypercube a location in a coordinate space determined by the dimensionality of the hypercube.

A zero-dimensional hypercube requires no coordinates to know where you are
A one-dimensional hypercube can use one bit to tell whether you're at the bottom or the top of the line segment.
A two-dimensional hypercube can use two bits: first bit is left vs. right, second is inherited from the line.
A three-dimensional hypercube can use a bit to tell front square from back, and inherit two bits from the square.
A four-dimensional hypercube can use a bit to tell left cube from right, and inherits three bits from cube.

This can continue through as many dimensions as you want.

The Hamming distance between two bit strings is the number of bits you have to change to convert one to the other: this is the same as the number of edges you have to traverse in a binary hypercube to get from one of the vertices to the other. The basic idea of an error correcting code is to use extra bits to increase the dimensionality of the hypercube, and make sure the Hamming distance between any two valid points is greater than one.

If the Hamming distance between valid strings is only one, a single-bit error results in another valid string. This means we can't detect an error.
If it's two, then changing one bit results in an invalid string, and can be detected as an error. Unfortunately, changing just one more bit can result in another valid string, which means we can't know which bit was wrong: so we can detect an error but not correct it.
If the Hamming distance between valid strings is three, then changing one bit leaves us only one bit away from the original error, but two bits away from any other valid string. This means if we have a one-bit error, we can figure out which bit is the error; but if we have a two-bit error, it looks like one bit from the other direction. So we can have single bit correction, but that's all.
Finally, if the Hamming distance is four, then we can correct a single-bit error and detect a double-bit error. This is frequently referred to as a SECDED (Single Error Correct, Double Error Detect) scheme.

So... now we have to think about how to increase the Hamming distance between valid strings.

Parity

The simplest case is by adding a parity bit. Suppose we have a three-bit word (so the bit strings define points in a cube). If we add a fourth bit, we can decree that any time we want to switch a bit in the original three-bit string, we also have to switch the parity bit. If we start with 000 in the left cube, so the full string is 0000, changing any one of the original three bits requires us to change to the other cube: 1001, 1010, and 1100. Now if we change a second bit, we have to move back to the left cube: 0011, 0101, 0110. And if we change the third bit, we move back to the right cube: 0111.

So, there is a Hamming distance of two between any two valid strings. If we get a one-bit error, we know it is an error because it's on one of the invalid vertices.

This can be computed by counting the number of 1's, and making sure it's always even (so this is called even parity). We could have selected exactly the opposite set of vertices as the valid ones, which would have given us odd parity. We picked even parity because we'll be using it in the next step.

Error Correction

The weakness of the parity scheme is that we can tell we had an error, but we can't know which bit is wrong. If we use enough extra bits, we can tell not only that a bit is wrong, but which one it is. Since we need to have enough check bits to spot both an error in the data and in the check bits themselves (after all, they aren't going to be perfect either), we need (log n) + 1 bits (Hamming derives this result much, much more carefully in his paper). The basic idea in what follows is that we'll divide the data bits into log n subsets where each subset contains roughly half of all the bits, and compute the even parity of each subset. If we have an error, we'll be able to tell which bit has the error because it will be uniquely determined by the set of subsets that turn up with bad parity.

(note: in Hamming's paper, the following appears just as unmotivated as it does here. I really have no idea how he derived this technique; he does show that it actually does establish the needed distance between valid bit positions) We'll put the check bits in bit positions which are powers of two, and intersperse the data bits between them. Here's what it looks like if we have eight data bits:

Bit Position	Position Number	Check Bit	Data Bit
12	1100		M8
11	1011		M7
10	1010		M6
9	1001		M5
8	1000	C8
7	0111		M4
6	0110		M3
5	0101		M2
4	0100	C4
3	0011		M1
2	0010	C2
1	0001	C1

Here's how we find the subsets: The data bit positions which contain a 1 in the bit corresponding to a check bit number are used in calculating that check bit. So, looking at the table, data bits M1, M2, M4, M5, and M7 are in rows 3, 5, 7, 9, and 11; those row numbers all contain 2⁰; those data bits are used in calculating check bit C1. We simply set C1 as having the parity of its data bits.

Looking at all the check bits, we get:

Now: if we get an error, the parity will be wrong for all of the sets based on that bit. The check bits that turn up wrong will be the bit number of the error!

We can combine ECC with parity. The way we do this, is we take the parity over all the bits in the word (including the check bits). In our bit numbering scheme, we consider Parity as bit 0000.

So, when we look at the parity and check bits, we get the following results:

If the parity is correct and the check bits are correct, our data is correct.
If the parity is incorrect, the check bits indicate which bit is wrong. If the check bits indicate that the error is in bit 0000, it's the parity bit itself that is incorrect.
If the parity is correct but the check bits indicate an error, there is a two-bit error. This can't be corrected.

Checksums and Cyclic Redundancy Checks

This technique has seen a lot more development, by a lot more authors, than error correcting codes have. The basic technique as described here appeared in a paper by Peterson and Brown, which appeared in the January, 1961 issue of the Proceedings of the Institute of Radio Engineers (the IRE was, of course, a predecessor organization to the IEEE). Much has been done since on selecting good CRCs.

Once again, we'll start by defining a simple technique, and then define a more complex one that works better.

Checksums

Suppose we have a fairly long message, which can reasonably be divided into shorter words (a 128 byte message, for instance). We can introduce an accumulator with the same width as a word (one byte, for instance), and as each word comes in, add it to the accumulator. When the last word has been added, the contents of the accumulator are appended to the message (as a 129th byte, in this case). The added word is called a checksum.

Now, the receiver performs the same operation, and checks the checksum. If the checksums agree, we assume the message was sent without error.

A related approach would be, instead of performing an actual addition, we can just do a bitwise exclusive-or of the new word with the accumulator. If we do this, we calculate a vertical parity on the data. Notice that in the special case of a one-bit word, this is equivalent to calculating the parity of the buffer!

Performing a vertical parity has two advantages over a real checksum: it can be performed with less hardware if the data is serial, and it will lead us into performing a CRC.

To see how a vertical parity can be performed with less hardware than a checksum, take a look at the next figure:

This figure shows an eight bit shift register and an exclusive-or gate. Initially, the shift register is filled with 0's. As each bit is put into it, the new bit is exclusive-ored with the contents of the eighth cell in the register. When the entire message has been passed through the shift register, it contains the vertical parity.

Here's an example of passing a 32 bit message through the unit:

00000000 11010110101010010100011101101010

(as we start, the shift register is empty

00000001 1010110101010010100011101101010
00000011 010110101010010100011101101010   
00000110 10110101010010100011101101010   
00001101 0110101010010100011101101010   
00011010 110101010010100011101101010   
00110101 10101010010100011101101010   
01101011 0101010010100011101101010   
11010110 101010010100011101101010

(at this point, the shift register contains the first byte of the message

10101100 01010010100011101101010   
01011001 1010010100011101101010   
10110011 010010100011101101010   
01100111 10010100011101101010   
11001111 0010100011101101010   
10011111 010100011101101010   
00111111 10100011101101010   
01111111 0100011101101010

(it now contains the vertical parity of the first two bytes of the message)

11111110 100011101101010
11111100 00011101101010   
11111001 0011101101010   
11110011 011101101010   
11100111 11101101010   
11001110 1101101010   
10011100 101101010   
00111000 01101010

(first three bytes)

01110000 1101010   
11100001 101010   
11000010 01010   
10000101 1010   
00001010 010   
00010100 10   
00101001 0   
01010010

(and the vertical parity of the whole 32 bit message)

Notice that a checksum or a vertical parity is much more efficient than ECC (in the sense that it doesn't need as many added bits), but it isn't capable of correcting errors.

The problem with checksums is that a 1-bit error turns into a 1-bit code. If you have a burst of noise, the odds are far too good that you'll end up with something that still looks correct, even though it isn't. The next approach, CRC checks, "smears" the results of the parity calculations through the signature, reducing the likelihood of that happening.

Mathematical Digression: Modulo-2 Arithmetic

Taking a bitwise exclusive-or in place of performing an addition is an example of "Modulo-2 Arithmetic," which is one form of "polynomial arithmetic." I've seen one author call it "CRC arithmetic."

Modulo-2 arithmetic is an arithmetic scheme; like most of the oddities that mathematicians like to study it seems completely useless to a non-mathematician at first glance but turns out to have some very practical applications. In this case, the practical application is in developing CRC checks.

The basic idea of modulo-2 arithmetic is just that we are working in binary, but we don't have a carry in addition or a borrow in subtraction. This means:

Addition and subtraction become the same operation: just a bit-wise exclusive-or. Because of this, the total ordering we expect of integers is replaced by a partial ordering: one number is greater than another iff its left-most 1 is farther left than the other's. This will have an impact on division, in a moment.
Multiplication is just like multiplication in ordinary arithmetic, except that the adds are performed using exclusive-ors instead of additions.
Division is like long division in ordinary arithmetic, except for two differences: the subtractions are replaced by exclusive-ors, and you can subtract any time the leftmost bits line up correctly (since, by the partial ordering described above, they are regarded as equal in this case).

It'll probably help to show examples of modulo-2 multiplication and division:

Multiplication

Division

1101
    --------
0110)0101110
     0110
     ----
      0111
      0110
      ----
       0011
       0000
       ----
        0110
        0110
        ----
        0000

Notice that the first subtraction is possible in modulo-2 arithmetic, while it wouldn't be possible in normal arithmetic.

One last thing to say here is that most of the time, when we perform a modulo-2 addition on two numbers we get an answer of 0 or 1. In this case, we're performing the arithmetic on each coefficient of the polynomial modulo-2. Easy to get confused....

Cyclic Redundancy Checks

I'm going to be following some of Peterson & Brown's notation here...

k is the length of the message we want to send, ie the number of information bits.
n is the total length of the message we will end up sending: the information bits followed by the check bits. Peterson and Brown call this a code polynomial.
n-k is the number of check bits. It is also the degree of the generating polynomial. The basic (mathematical) idea is that we're going to pick the n-k check digits in such a way that the code polynomial is divisible by the generating polynomial. Then we send the data, and at the other end we look to see whether it's still divisible by the generating polynomial; if it's not then we know we have a error, if it is we hope there was no error.

The way we calculate a CRC is we establish some predefined n-k+1 bit number P (called the Polynomial, for reasons relating to the fact that modulo-2 arithmetic is a special case of polynomial arithmetic). Now we append n-k 0's to our message, and divide the result by P using modulo-2 arithmetic. The remainder is called the Frame Check Sequence. Now we ship off the message with the remainder appended in place of the 0's. The receiver can either recompute the FCS and see if it gets the same answer, or it can just divide the whole message (including the FCS) by P and see if it gets a remainder of 0!

As an example, let's set a 5-bit polynomial of 11001, and compute the CRC of a 16 bit message:

---------------------
11001)10011101010101100000
      11001
      -----
       1010101010101100000
       11001
        ----
        110001010101100000
        11001
         ----
         00011010101100000
            11001
             ----
             0011101100000
               11001
                ----
                 100100000
                 11001
                 -----
                  10110000
                  11001
                  -----
                   1111000
                   11001
                   -----
                     11100
                     11001
                     -----
                      0101

Notice that when I did the division, I didn't bother to keep track of the quotient; we don't care about the quotient. Our only goal here is to get the remainder (0101), which is the FCS.

CRC's can actually be computed in hardware using a shift register and some number of exclusive-or gates (sounds a bit like the vertical parity calculation, doesn't it?).

The key insight is that we can perform a subtraction any time there is a 1 in the bit that lines up with the most significant bit of the polynomial, and we can perform that subtraction by performing an exclusive-or of the bits corresponding to 1's in all the other places of the polynomial. This lets us implement the CRC calculation by using a shift register similar to the one for vertical parity.

You can see how it's done by comparing the division we performed above to the circuit in the next figure. The figure shows a shift register; the string to be checked is inserted from the right. Whenever a "1" exits the left side of the shift register, it means there is a 1 in the most significant bit of the part of the dividend we're working with; since we're working in modulo-2 arithmetic, this means we can do a subtraction. What this works out to is:

The most significant bit will be xored away, so it falls off to the left.
For every other bit with a "1" in the divisor, perform an exclusive-or with the corresponding bit in the number being checked.
For bits with a "0" in the divisor, do nothing.

The figure below attempts to show this for the example CRC polynomial. Each of the square boxes is a position in the shift register, where a value can be stored. Every round box is a position where we may or may not perform an exclusive-or, depending on the polynomial we're using. You can see the value of the CRC polynomial written above the round boxes.

I keep calling this a polynomial, and writing it as a binary number. Frequently, you'll find a CRC polynomial written in polynomial form; the one we've been using would be written as x⁴ + x³ + x⁰.

So, just a little bit more. First, there is quite a bit of theory behind choosing a "good" CRC polynomial; the choice of polynomial can be tuned to make sure that any burst of some given length can be caught.

Properties of Cyclic Redundancy Checks

The paper lists a few properties of CRCs, which deserve mention:

If the rightmost place of the generating polynomial were 0, the generating polynomial would be divisible by X. That being the case, any polynomial divisible by P would also be divisible by X, and so the last bit of the check bits would always be 0. That would be useless, so we always have a 1 in the least significant bit of the generating polynomial
That's a roundabout way of saying that if you're going to have an n-k bit polynomial, the two outlying bits should be 1's, otherwise you've effectively got a shorter polynomial than that.
Any error checking code that can always detect a two-bit error can always correct any one-bit error. In the most ridiculous case, we can just check by flipping every bit of the received message; whenever we flip the wrong bit we get a two-bit error, when we flip the right one we get a 0-bit error. Of course, Hamming's scheme is a lot more clever than this!
Any cyclic code whose generating polynomial is of length n-k will always detect any burst error of length less than n-k.

There are a few "classic" CRC polynomials of given lengths which are so sell established that they've been given names.

Name	Definition
	As Polynomial	As Number
CRC12	X¹² + X¹¹ + X³ + X + 1	1100000001011
CRC16	X¹⁶ + X¹⁵ + X² + 1	11000000000000101
CRC-CCITT	X¹⁶ + X¹² + X⁵ + 1	10001000000100001
CRC32	X³² +X²⁶ + X²³ + X²² +X¹⁶ + X¹² + X¹¹ +X¹⁰ + X⁸ + X⁷ +X⁵ + X⁴ + X² + X + 1	100000100110000010001110110110111

QISCET MCA 2009-12 BATCH