Error Correcting Codes

Most of us have used error-correcting codes, perhaps without knowing it. The International Standard Book Number (ISBN) system identifies every book with a ten-digit number, such as 0-226-53420-0. The first nine digits are the actual number but the tenth is added according to a mathematical formula based on the first nine. If a single one of the digits is changed, as in a misprint when ordering a book, a simple check verifies that something is wrong.

Other common numbers with check digits include the Chemical Abstracts Service (CAS) method of identifying chemical compounds and the United States Postal Service (USPS) nine-digit "zip+four" code, which when it is bar-coded includes a tenth check digit hidden to the human eye but visible to the bar code reader.

Some high-end computer memory chips, "ECC RAM," use extended nine-bit bytes. The ninth bit, or "check bit," is always set so that the total number of ones in the extended byte is even. (This rule is called a "checksum." Specifically, if the sum of the nine bits is not even, an error has been detected.) This ninth bit is not used or even seen by your software, but the computer hardware itself will signal an error if it ever reads an extended byte that contains an odd number of ones.

All these processes can detect a single error within a very short message: nine decimal digits for the ISBN and USPS codes, a varying number of decimal digits for the CAS numbers, eight binary digits for ECC RAM. However, they cannot correct any error that is detected. Furthermore, some combinations of two or more errors occurring within the message will not be detected.

Coding Theory

Coding theory deals with ways of extending the check-bit and checksum ideas to enable detection and even correction of large numbers of errors within messages. It builds on some beautiful mathematical ideas that we can hint at here. Let's focus on binary messages, the stuff of which computer communications is built.

The first thing is to break up the message into "words" of fixed size. We will focus on how to correct errors occurring within individual words.

If the words are n bits long, then there are 2ⁿ possible messages each word can carry. The set of all such words has an interesting geometry based on the Hamming distance. The distance between two words is just the number of places in which they differ.

For example, let's consider the case n=7. The distance between 0100111 and 1100101 is two, because these words differ in the first and sixth places.

Words can also be added to each other. This is done by the usual odd-even rules of addition, performed bit by bit. Specifically, 0+0=0, 0+1 = 1, 1+0 = 1, and 1+1 = 0. Thus 0100111 + 1100101 = 1000010.

The Geometry of Error-Correcting Codes

With a little imagination you might recognize these words as vectors. They have seven components rather than the usual two or three you might be used to, and their components are limited to just zeros and ones, but otherwise they behave in most respects like any vector. Just like we can think of vectors representing points in 2D or 3D space, we can think of words as "points" in some "7D" space. This analogy is helpful because it lets us picture properties of codes. No, you don't need to visualize things in 7D: you just use your 2D and 3D intuition, and then generalize.

In particular, consider the effect of altering bits within a word. Altering one bit produces a word which, by definition, is exactly one unit distant from the original. Altering two bits produces a word two units away, and so on. This gives us two radically different but equivalent descriptions of certain sets of words, depending on whether we focus on alteration or on distance. I will illustrate.

Pick any word you like. Consider the set of words that differ from this word by a change of at most a single bit. If you put the original word into a channel, these are the words that might come out the other end if the channel makes at most one error. That's the first description.

In terms of distance, think of the original word as a point. We are talking about the set of all points within one unit of the first. In 2D, that's the inside of a circle. In 3D, that's the inside of a sphere. In 7D, it's--well, let's still call it a sphere.

So here's the idea: a noisy channel typically displaces input words ("points"). If it is most likely to displace them into spheres of limited radius about the original words, then we can think of each word that comes out as being some random point within a sphere. We just have to find the center of that sphere.

That's hard to do, unless we restrict the words going into the channel. Let's make them as far apart (in Hamming distance) as possible. For instance, if we use words that are no closer than three units to each other, then their unit spheres will not overlap.

If the channel randomly displaces one of them by one unit, it still lies in the correct unit sphere. In principle, we can find the center of that sphere. That tells us exactly what the original word was and lets us correct the error.

Multiple errors can be corrected by placing all our input words at distances of five, seven, ..., Hamming units from each other. Mathematically, it's the same problem as packing marbles efficiently in a box.

Wait, you're going to say, but don't we have to use all the words for our messages? The answer is no. It works like any other substitution code. For example, it is possible to find 16 words in the 7D space that all lie at least three units from each other. So what we will do is take our original message, break it into blocks of four bits--of which there are exactly 16 possibilities--and assign each one of those 16 blocks to one of the 16 words in 7D space. In effect, we have added three bits of redundancy to our message by encoding every four bits into words that are seven bits long. But we can now actually correct any single error that might occur in any four-bit block of the original message.

(Fitting 16 unit spheres into 7D space without overlap is quite a feat because it is perfect: you can see from the figures that each unit sphere contains exactly eight points, and it is evident that 16 * 8 = 128, the total number of points in 7D space. Every possible point is accounted for and none is left out.)

The task of coding theory is to find good sphere packings in these spaces and to work out efficient algorithms for encoding and decoding the messages. Claude Shannon, in his seminal work in this subject a half century ago, proved that finding good sphere packings is always possible. The idea of his proof is startling: he showed that if you use a space of sufficiently high dimension, then a set of words (sphere centers) taken at random is likely to work! Unfortunately, Shannon's Theorem does not provide efficient encoding or decoding methods.

The Hamming Codes

Clever people have found many ways to produce efficient error-correcting codes. Usually they have exploited mathematical properties of these multi-dimensional spaces arising from the fact we can add vectors and, in many cases, multiply them. (The multiplication is not what you might think it is, and is not necessarily simple: it's related to deeper properties of these spaces that enable us to think of them as a generalization of numbers, or finite fields.)

To illustrate what an error-correcting code (ECC) can do, I decided to implement a family of codes, the Hamming codes, which can correct one error per word, because these are exceedingly simple to encode and decode and they correspond to perfect sphere packings.

The smallest interesting Hamming code is a (7, 4) linear code. This means:

	The encoded words are seven bits long.
	The original message words are four bits long.
	Any two encoded words add up to a third encoded word.

This business about addition is a seemingly minor detail, and you can safely ignore it, but it is a key property to making the encoding and decoding very simple.

The following table shows the entire (7, 4) Hamming code.

Message word	Encoded word
0000	0000 000
0001	0001 111
0010	0010 110
0011	0011 001
0100	0100 101
0101	0101 010
0110	0110 011
0111	0111 100
1000	1000 011
1001	1001 100
1010	1010 101
1011	1011 010
1100	1100 110
1101	1101 001
1110	1110 000
1111	1111 111

I have visually broken each code word into a sub-block of four bits followed by three more bits. Evidently, each sub-block is just the original message word. We can think of the additional three bits as being check bits. They are generated by simple formulas:

The first check bit (in the fifth place) is the sum of the message bits in the second, third, and fourth places.
The second check bit (in the sixth place) is the sum of the message bits in the first, third, and fourth places.
The third check bit (in the seventh place) is the sum of the message bits in the first, second, and fourth places.

(Remember, when we sum bits, we are thinking of 1's as "odd" and 0's as "even." So, for example, 1+1+1 = 1 because odd+odd+odd is always odd.)

Formulas like this make it simple and quick to encode each block of four message bits. Technically, such formulas are maintained in data structures call "generator matrices."

For decoding purposes, instead of one checksum, there are three. In any valid encoded word,

The sum of bits two through five must be zero. That is, there must be an even number of ones in places two through five.
The sum of bits one, three, four, and six must be zero.
The sum of bits one, two, four, and seven must be zero.

Technically, such checksums are maintained in data structures called "parity-check matrices."

Now suppose one or more of these checksums is one. Surely an error has occurred. We know it is possible to fix it (assuming only one bit of the seven was altered in the channel). But how do we figure it out?

For the (7,4) Hamming code, we have three checksums, each of which is either zero or one. At least one of them is a one when an error has occurred. If we write the three checksums down in order, this gives seven possible sequences: 001, 010, 011, 100, 101, 110, and 111. Each of these can be read as a binary number, giving values ranging from 1 through 7. The binary value tells us the location of the bit that changed.

For example, suppose we want to transmit a 0110 block through the channel. It will be encoded as 0110001. Imagine that the fourth bit is changed during transmission from 0 to 1, so that 0111001 is received. The checksums are 1, 0, 0. In binary that's a four, telling us the fourth bit is incorrect. So we flip it back to a 0, drop the three check bits (they have done their job), and record a 0110 as having been received. Mission accomplished!

As I indicated earlier, there is more than one Hamming code. For every integer m >= 3, there is a (2^m-1, 2^m-1-m) Hamming code. The first three combinations are (7, 4), (15, 11), and (31, 26). The latter, for instance, adds five check bits to each block of 26 message bits, enabling detection of a single error in each sequence of 26 message bits. That's less protection against errors than the one-out-of-four for the (7, 4) Hamming code, but it's much more efficient, since messages expand in length only by 31/26 instead of by 7/4. As always, there is a trade-off between ability to correct errors and level of redundancy.

Constructing the Hamming codes is simple but difficult to describe without introducing specialized terminology from coding theory. If you are curious how it's done, check out the Avenue script Hamming.Make for creating generator and parity-check matrices for any Hamming code.

Interleaving

The theory presented so far is capable of creating codes that can detect a limited number of errors occurring randomly within fixed-size blocks of the message. However, the kinds of errors we might encounter in vector steganography will rarely be of this type. Jittering errors can occur, for instance, when a vertex is moved. That will change about half the bits that were jittered into both coordinates, creating a burst of errors. That's not random enough. Embedding errors can occur if vertices are moved, too. In that case, the lengths on each side of the vertex can be altered, thereby destroying any part of the message coded into those two lengths. Again, a burst of errors occurs.

A simple trick lets us continue to use the familiar error-correcting codes to protect against short bursts. It is called interleaving. It works by spreading the bits to be transmitted throughout the entire message.

You can picture this by imagining the encoded message being written in a circle.

The letters "a", "b", "c", and so on, stand for the zeros and ones of a 26-bit message. They show the proper sequence of the bits.

Suppose we would like to correct any burst of three consecutive errors. To be concrete, assume we are using a code capable of correcting a single error in any block of eight bits; more than one error in any block of eight bits will destroy the block. To defend against this, we will skip every eight positions in laying the message around the circle.

The message bits will be transmitted in the resulting order, proceeding clockwise around the circle. Now, any burst of three errors must occur in eight distinct blocks. Each block can correct its error, provided no additional errors occur in the block.

The ArcView steganography extension offers this form of interleaving