Error Detection and Correction
The basic problem we have to resolve is that memory and communications technology isn't totally reliable; we have to expect and be ready to deal with errors in the hardware. This document will describe two very different technologies for detecting, and maybe correcting, errors that may occur in data storage and transmission.
The first approach to be described is more appropriate for environments like memory: a relatively small amount of data is fetched in parallel. This approach, called "error detecting and correcting codes," is based on defining a distance between two bit strings in terms of the number of bits that have to change to get from the first string to the second. Extra bits are added to each string, which are set so that some minimum number of bits must change to get from one valid string to another. If the received string isn't valid, it is assumed that the correct string is the one "closest" to the received string.
The second approach is more appropriate to environments in which relatively large amounts of data are to be transferred, but they are transferred serially. In this approach a "signature" is appended to the data string; the number of bits in the signature is much less than the number of bits that would be required to do an error correcting code. This approach will lead to adding checksums or cyclic redundancy checks to the string.
Error Detecting and Correcting Codes
R. W. Hamming wrote the paper that both opened and closed this field in 1950. His interest was in providing a means of self-checking in computers, which were just being developed at the time he wrote this. the paper appeared in the Bell System Technical Journal, April, 1950. Definitely worth tracking down in the library and reading.
Bit Strings as Addresses in Binary Hypercubes
The best starting point for understanding ECC codes is to consider bit strings as addresses in a binary hypercube. A hypercube is a generalization of a cube to various dimensions; we're probably most familiar with the notion of a four-dimensional hypercube. Here's a picture of binary hypercubes for several different dimensionalities:
Each of them was created by copying the one to the left twice, and connecting corresponding vertices.
We can assign each vertex in a hypercube a location in a coordinate space determined by the dimensionality of the hypercube.
- A zero-dimensional hypercube requires no coordinates to know where you are
- A one-dimensional hypercube can use one bit to tell whether you're at the bottom or the top of the line segment.
- A two-dimensional hypercube can use two bits: first bit is left vs. right, second is inherited from the line.
- A three-dimensional hypercube can use a bit to tell front square from back, and inherit two bits from the square.
- A four-dimensional hypercube can use a bit to tell left cube from right, and inherits three bits from cube.
This can continue through as many dimensions as you want.
The Hamming distance between two bit strings is the number of bits you have to change to convert one to the other: this is the same as the number of edges you have to traverse in a binary hypercube to get from one of the vertices to the other. The basic idea of an error correcting code is to use extra bits to increase the dimensionality of the hypercube, and make sure the Hamming distance between any two valid points is greater than one.
- If the Hamming distance between valid strings is only one, a single-bit error results in another valid string. This means we can't detect an error.
- If it's two, then changing one bit results in an invalid string, and can be detected as an error. Unfortunately, changing just one more bit can result in another valid string, which means we can't know which bit was wrong: so we can detect an error but not correct it.
- If the Hamming distance between valid strings is three, then changing one bit leaves us only one bit away from the original error, but two bits away from any other valid string. This means if we have a one-bit error, we can figure out which bit is the error; but if we have a two-bit error, it looks like one bit from the other direction. So we can have single bit correction, but that's all.
- Finally, if the Hamming distance is four, then we can correct a single-bit error and detect a double-bit error. This is frequently referred to as a SECDED (Single Error Correct, Double Error Detect) scheme.
So... now we have to think about how to increase the Hamming distance between valid strings.
The simplest case is by adding a parity bit. Suppose we have a three-bit word (so the bit strings define points in a cube). If we add a fourth bit, we can decree that any time we want to switch a bit in the original three-bit string, we also have to switch the parity bit. If we start with 000 in the left cube, so the full string is 0000, changing any one of the original three bits requires us to change to the other cube: 1001, 1010, and 1100. Now if we change a second bit, we have to move back to the left cube: 0011, 0101, 0110. And if we change the third bit, we move back to the right cube: 0111.
So, there is a Hamming distance of two between any two valid strings. If we get a one-bit error, we know it is an error because it's on one of the invalid vertices.
This can be computed by counting the number of 1's, and making sure it's always even (so this is called even parity). We could have selected exactly the opposite set of vertices as the valid ones, which would have given us odd parity. We picked even parity because we'll be using it in the next step.
The weakness of the parity scheme is that we can tell we had an error, but we can't know which bit is wrong. If we use enough extra bits, we can tell not only that a bit is wrong, but which one it is. Since we need to have enough check bits to spot both an error in the data and in the check bits themselves (after all, they aren't going to be perfect either), we need (log n) + 1 bits (Hamming derives this result much, much more carefully in his paper). The basic idea in what follows is that we'll divide the data bits into log n subsets where each subset contains roughly half of all the bits, and compute the even parity of each subset. If we have an error, we'll be able to tell which bit has the error because it will be uniquely determined by the set of subsets that turn up with bad parity.
(note: in Hamming's paper, the following appears just as unmotivated as it does here. I really have no idea how he derived this technique; he does show that it actually does establish the needed distance between valid bit positions) We'll put the check bits in bit positions which are powers of two, and intersperse the data bits between them. Here's what it looks like if we have eight data bits:
Here's how we find the subsets: The data bit positions which contain a 1 in the bit corresponding to a check bit number are used in calculating that check bit. So, looking at the table, data bits M1, M2, M4, M5, and M7 are in rows 3, 5, 7, 9, and 11; those row numbers all contain 20; those data bits are used in calculating check bit C1. We simply set C1 as having the parity of its data bits.
Looking at all the check bits, we get:
Now: if we get an error, the parity will be wrong for all of the sets based on that bit. The check bits that turn up wrong will be the bit number of the error!
We can combine ECC with parity. The way we do this, is we take the parity over all the bits in the word (including the check bits). In our bit numbering scheme, we consider Parity as bit 0000.
So, when we look at the parity and check bits, we get the following results:
- If the parity is correct and the check bits are correct, our data is correct.
- If the parity is incorrect, the check bits indicate which bit is wrong. If the check bits indicate that the error is in bit 0000, it's the parity bit itself that is incorrect.
- If the parity is correct but the check bits indicate an error, there is a two-bit error. This can't be corrected.
Checksums and Cyclic Redundancy Checks
This technique has seen a lot more development, by a lot more authors, than error correcting codes have. The basic technique as described here appeared in a paper by Peterson and Brown, which appeared in the January, 1961 issue of the Proceedings of the Institute of Radio Engineers (the IRE was, of course, a predecessor organization to the IEEE). Much has been done since on selecting good CRCs.
Once again, we'll start by defining a simple technique, and then define a more complex one that works better.
Suppose we have a fairly long message, which can reasonably be divided into shorter words (a 128 byte message, for instance). We can introduce an accumulator with the same width as a word (one byte, for instance), and as each word comes in, add it to the accumulator. When the last word has been added, the contents of the accumulator are appended to the message (as a 129th byte, in this case). The added word is called a checksum.
Now, the receiver performs the same operation, and checks the checksum. If the checksums agree, we assume the message was sent without error.
A related approach would be, instead of performing an actual addition, we can just do a bitwise exclusive-or of the new word with the accumulator. If we do this, we calculate a vertical parity on the data. Notice that in the special case of a one-bit word, this is equivalent to calculating the parity of the buffer!
Performing a vertical parity has two advantages over a real checksum: it can be performed with less hardware if the data is serial, and it will lead us into performing a CRC.
To see how a vertical parity can be performed with less hardware than a checksum, take a look at the next figure:
This figure shows an eight bit shift register and an exclusive-or gate. Initially, the shift register is filled with 0's. As each bit is put into it, the new bit is exclusive-ored with the contents of the eighth cell in the register. When the entire message has been passed through the shift register, it contains the vertical parity.
Here's an example of passing a 32 bit message through the unit:
00000000 11010110101010010100011101101010(as we start, the shift register is empty00000001 1010110101010010100011101101010 00000011 010110101010010100011101101010 00000110 10110101010010100011101101010 00001101 0110101010010100011101101010 00011010 110101010010100011101101010 00110101 10101010010100011101101010 01101011 0101010010100011101101010 11010110 101010010100011101101010(at this point, the shift register contains the first byte of the message10101100 01010010100011101101010 01011001 1010010100011101101010 10110011 010010100011101101010 01100111 10010100011101101010 11001111 0010100011101101010 10011111 010100011101101010 00111111 10100011101101010 01111111 0100011101101010(it now contains the vertical parity of the first two bytes of the message)11111110 100011101101010 11111100 00011101101010 11111001 0011101101010 11110011 011101101010 11100111 11101101010 11001110 1101101010 10011100 101101010 00111000 01101010(first three bytes)01110000 1101010 11100001 101010 11000010 01010 10000101 1010 00001010 010 00010100 10 00101001 0 01010010(and the vertical parity of the whole 32 bit message)
Notice that a checksum or a vertical parity is much more efficient than ECC (in the sense that it doesn't need as many added bits), but it isn't capable of correcting errors.
The problem with checksums is that a 1-bit error turns into a 1-bit code. If you have a burst of noise, the odds are far too good that you'll end up with something that still looks correct, even though it isn't. The next approach, CRC checks, "smears" the results of the parity calculations through the signature, reducing the likelihood of that happening.
Mathematical Digression: Modulo-2 Arithmetic
Taking a bitwise exclusive-or in place of performing an addition is an example of "Modulo-2 Arithmetic," which is one form of "polynomial arithmetic." I've seen one author call it "CRC arithmetic."
Modulo-2 arithmetic is an arithmetic scheme; like most of the oddities that mathematicians like to study it seems completely useless to a non-mathematician at first glance but turns out to have some very practical applications. In this case, the practical application is in developing CRC checks.
The basic idea of modulo-2 arithmetic is just that we are working in binary, but we don't have a carry in addition or a borrow in subtraction. This means:
- Addition and subtraction become the same operation: just a bit-wise exclusive-or. Because of this, the total ordering we expect of integers is replaced by a partial ordering: one number is greater than another iff its left-most 1 is farther left than the other's. This will have an impact on division, in a moment.
- Multiplication is just like multiplication in ordinary arithmetic, except that the adds are performed using exclusive-ors instead of additions.
- Division is like long division in ordinary arithmetic, except for two differences: the subtractions are replaced by exclusive-ors, and you can subtract any time the leftmost bits line up correctly (since, by the partial ordering described above, they are regarded as equal in this case).
It'll probably help to show examples of modulo-2 multiplication and division:
1101 0110 ---- 0000 11010 110100 0000000 ------- 0101110
1101 -------- 0110)0101110 0110 ---- 0111 0110 ---- 0011 0000 ---- 0110 0110 ---- 0000
Notice that the first subtraction is possible in modulo-2 arithmetic, while it wouldn't be possible in normal arithmetic.
One last thing to say here is that most of the time, when we perform a modulo-2 addition on two numbers we get an answer of 0 or 1. In this case, we're performing the arithmetic on each coefficient of the polynomial modulo-2. Easy to get confused....
Cyclic Redundancy Checks
I'm going to be following some of Peterson & Brown's notation here...
- k is the length of the message we want to send, ie the number of information bits.
- n is the total length of the message we will end up sending: the information bits followed by the check bits. Peterson and Brown call this a code polynomial.
- n-k is the number of check bits. It is also the degree of the generating polynomial. The basic (mathematical) idea is that we're going to pick the n-k check digits in such a way that the code polynomial is divisible by the generating polynomial. Then we send the data, and at the other end we look to see whether it's still divisible by the generating polynomial; if it's not then we know we have a error, if it is we hope there was no error.
The way we calculate a CRC is we establish some predefined n-k+1 bit number P (called the Polynomial, for reasons relating to the fact that modulo-2 arithmetic is a special case of polynomial arithmetic). Now we append n-k 0's to our message, and divide the result by P using modulo-2 arithmetic. The remainder is called the Frame Check Sequence. Now we ship off the message with the remainder appended in place of the 0's. The receiver can either recompute the FCS and see if it gets the same answer, or it can just divide the whole message (including the FCS) by P and see if it gets a remainder of 0!
As an example, let's set a 5-bit polynomial of 11001, and compute the CRC of a 16 bit message:
--------------------- 11001)10011101010101100000 11001 ----- 1010101010101100000 11001 ---- 110001010101100000 11001 ---- 00011010101100000 11001 ---- 0011101100000 11001 ---- 100100000 11001 ----- 10110000 11001 ----- 1111000 11001 ----- 11100 11001 ----- 0101
Notice that when I did the division, I didn't bother to keep track of the quotient; we don't care about the quotient. Our only goal here is to get the remainder (
0101), which is the FCS.
CRC's can actually be computed in hardware using a shift register and some number of exclusive-or gates (sounds a bit like the vertical parity calculation, doesn't it?).
The key insight is that we can perform a subtraction any time there is a 1 in the bit that lines up with the most significant bit of the polynomial, and we can perform that subtraction by performing an exclusive-or of the bits corresponding to 1's in all the other places of the polynomial. This lets us implement the CRC calculation by using a shift register similar to the one for vertical parity.
You can see how it's done by comparing the division we performed above to the circuit in the next figure. The figure shows a shift register; the string to be checked is inserted from the right. Whenever a "1" exits the left side of the shift register, it means there is a 1 in the most significant bit of the part of the dividend we're working with; since we're working in modulo-2 arithmetic, this means we can do a subtraction. What this works out to is:
- The most significant bit will be xored away, so it falls off to the left.
- For every other bit with a "1" in the divisor, perform an exclusive-or with the corresponding bit in the number being checked.
- For bits with a "0" in the divisor, do nothing.
The figure below attempts to show this for the example CRC polynomial. Each of the square boxes is a position in the shift register, where a value can be stored. Every round box is a position where we may or may not perform an exclusive-or, depending on the polynomial we're using. You can see the value of the CRC polynomial written above the round boxes.
I keep calling this a polynomial, and writing it as a binary number. Frequently, you'll find a CRC polynomial written in polynomial form; the one we've been using would be written as x4 + x3 + x0.
So, just a little bit more. First, there is quite a bit of theory behind choosing a "good" CRC polynomial; the choice of polynomial can be tuned to make sure that any burst of some given length can be caught.
Properties of Cyclic Redundancy Checks
The paper lists a few properties of CRCs, which deserve mention:
- If the rightmost place of the generating polynomial were 0, the generating polynomial would be divisible by X. That being the case, any polynomial divisible by P would also be divisible by X, and so the last bit of the check bits would always be 0. That would be useless, so we always have a 1 in the least significant bit of the generating polynomial
That's a roundabout way of saying that if you're going to have an n-k bit polynomial, the two outlying bits should be 1's, otherwise you've effectively got a shorter polynomial than that.
- Any error checking code that can always detect a two-bit error can always correct any one-bit error. In the most ridiculous case, we can just check by flipping every bit of the received message; whenever we flip the wrong bit we get a two-bit error, when we flip the right one we get a 0-bit error. Of course, Hamming's scheme is a lot more clever than this!
- Any cyclic code whose generating polynomial is of length n-k will always detect any burst error of length less than n-k.
There are a few "classic" CRC polynomials of given lengths which are so sell established that they've been given names.
|As Polynomial||As Number|
|CRC12||X12 + X11 + X3 + X + 1||1100000001011|
|CRC16||X16 + X15 + X2 + 1||11000000000000101|
|CRC-CCITT||X16 + X12 + X5 + 1||10001000000100001|
|CRC32||X32 +X26 + X23 + X22 +X16 + X12 + X11 +X10 + X8 + X7 +X5 + X4 + X2 + X + 1||100000100110000010001110110110111|