| |||||||||
Arithmetic coding in data compression is a process that encodes an entire message -- i.e., an ordered sequence of symbols containing data -- into a single number, a fraction n that is greater than or equal to 0 and less than 1 (<math>0 \le n < 1<math>). There is great similarity between arithmetic coding and Huffman coding -- in fact, it has been shown that Huffman is just a specialized case of arithmetic coding -- but because arithmetic coding translates the entire message into one number represented in base b, rather than translating each symbol of the message into a series of digits in base b, it will often approach optimal entropy encoding much more closely than Huffman can.
Arithmetic coding starts by determining a model of the data -- basically a prediction of what patterns will be found in the symbols of the message. The more accurate this prediction is, the closer to optimality the arithmetic coder will be able to achieve. For example, a simple, static model for describing the output of a particular monitoring instrument over time might be "60% chance of symbol NEUTRAL; 20% chance of symbol POSITIVE; 10% chance of symbol NEGATIVE; 10% chance of symbol END-OF-DATA". (Models can handle other alphabets than the simple four-symbol set chosen for this example, of course. More sophisticated models are also possible: higher-order modelling changes its estimation of the current probability of a symbol based on the symbols that precede it, so that in a model for English text, for example, the percentage chance of "u" would be much higher when it followed a "Q" or a "q". Models can even be adaptive, so that they continuously change their prediction of the data based on what the stream actually contains. Whatever model the encoder uses, however, the decoder must also have.)
At any point in time in coding the input stream of data,
The number of bits used to encode a character depends on how much of the range (0, 1) the character's probability is.
The larger the range, the less bits it takes to code the character. The smaller the range, the more bits it takes to code the character.
An arithmetic encoder
As each symbol is processed, the encoder will restrict the output to a smaller interval.
Let N be the number of distinct symbols in the input; let x1, x2 ... xN represent the symbols, and let P1, P2 ... PN represent the probability of each symbol appearing. At each step in the process, the output is restricted to the current interval [y, y+R). Partition this interval into N disjoint subintervals:
\begin{matrix}
\end{matrix} <math>
Therefore the size of Ii is PiR. If the next symbol is xi, then restrict the output to the new interval Ii.
Note that at each stage, all the possible intervals are pairwise disjoint. Therefore a specific sequence of symbols produces exactly one unique output range, and the process can be reversed.
Since arithmetic encoders are typically implemented on binary computers, the actual output of the encoder is generally the shortest sequence of bits representing the fractional part of a rational number in the final interval.
Suppose our entire input string contains M symbols: then xi appears exactly PiM times in the input. Therefore, the size of the final interval will be
\begin{matrix}
\\
\end{matrix} <math>
By Shannon's theorem, this is the total entropy of the original message. Therefore arithmetic encoding is near-optimal entropy encoding.
However, IBM and other companies own patents in the United States and other countries on algorithms essential for implementing an arithmetic encoder.
An earlier (open content) version of the above article was posted on .