Andrew Kensler
About Archive Topics Feed

Sep 14, 2020

A Tour of the Tiny and Obfuscated Image Decoder

I was mystified when I first came across the “Most Inflationary” winning entry in the 2018 IOCCC. Written by the incredibly prolific Fabrice Bellard, this tiny program with just 4KB of source code emitted a 128×128 resolution version of the (in)famous Lena test image. Voodoo! The IOCCC judges wrote “We could understand some of the arithmetic but none of the magic.” I was determined to figure out the magic and inspired by Fabien Sanglard to try to write up how it works.

Using the program is straightforward enough, and both the source and the details of running it can be found on its IOCCC page. I used GCC to compile it and piped the output directly to ImageMagick to view it:

gcc -O3 -o prog prog.c
./prog | display -                        # Show embedded image
./prog d < vintage_cars.bin | display -   # Show an external image

Running it with no arguments like the first case will decompress the embedded data and produce the following image (zoomed for detail):

image.png

Admittedly, this has a fair number of compression artifacts, but it’s still pretty darned impressive for 2439 bytes of image data and 1544 bytes of source code for the decompressor.

So let’s see how it works. This will be a deep dive and by far my longest post yet (I’ve erred on the side of too much detail), so make yourself comfortable! From here on in, I’m going to show the pieces of the original program juxtaposed with the corresponding pieces from my deobfuscated version. When I mention line numbers, I’ll be referring to the later unless noted otherwise. I’ll typically use italicized numbers in parenthesis to denote these.

For my version I have expanded the lone macro (used to compact for loops), reformated the source, and renamed the variables. The only structural changes that I have made are from updating to modern style function headers and pushing declarations down to be as local in scope as possible (C99 style). In one case, I gave a variable different names when pushing it down into two blocks where it had been reused in different ways.

Input Header

We’ll start at the end of the program with the main() function and then work our way backwards to the beginning; we’ll essentially go from top to bottom in terms of control flow. Here’s the main() function:

112: main(D){
113: int a,l,L,M,g,N;
114: s=D>1?256:1968;
115: Q();
116: g=n(5);
117: f=1<<g;
118: N=f-n(5);
119: O=n(5);
120: P=n(5);
121: W(0,0,g);
122: printf("P6 %d %d 255 ",f,N);
123: e(a,N*f){
124: l=y[0][a];
125: L=y[1][a];
126: M=y[2][a];
127: D=l-L;
128: K(D+M);
129: K(l+L);
130: K(D-M);
131: }
132: return 0;
133: }
228: int main( int argc )
229: {
230:     radix = argc > 1 ? 256 : 1968;
231:     renormalize_and_read();
232:     int level = decode_unsigned( 5 );
233:     width = 1 << level;
234:     int height = width - decode_unsigned( 5 );
235:     luma = decode_unsigned( 5 );
236:     chroma = decode_unsigned( 5 );
237:     extract_block( 0, 0, level );
238:     printf( "P6 %d %d 255 ", width, height );
239:     for ( int pixel = 0; pixel < height * width; pixel++ )
240:     {
241:         int Y  = image[ 0 ][ pixel ];
242:         int Cg = image[ 1 ][ pixel ];
243:         int Co = image[ 2 ][ pixel ];
244:         int temporary = Y - Cg;
245:         write_byte_clamped( temporary + Co );
246:         write_byte_clamped( Y         + Cg );
247:         write_byte_clamped( temporary - Co );
248:     }
249:     return 0;
250: }

The image is basically encoded in two distinct layers. The lower level uses a lossless entropy encoding scheme that stores only single bits and unsigned integers. We’ll see this in detail later. For now, it’s enough to know that the first line (230) checks for an argument to determine whether to expect the image on standard input or to decompress the image from an internal string. The data is stored slightly differently in the two modes and the related value here doubles as a flag. The second line (231) then primes the pump for the decoding this bitstream. (However, this turns out to be superfluous as it would have been called later anyway.)

At the higher level, the first thing to be decoded (232–236) is a small header comprised of four integers. The first of these stores the image width as a power of two. Due to a fixed-size array used to store the decompressed image, the width is limited to 1024. The second integer stores how many lines shorter the image height is compared to the width. This will be zero for square images but means that only square or landscape oriented images are possible. The next two lines decode a pair of parameters related to the image quality. Higher numbers here give greater compression at the expense of image quality. The lowest legal values for each are 1 and should produce near lossless results. For the built-in Lena image, the four values in this header are 7, 0, 41, and 82, respectively.

PPM Output

The decoded image is written to standard output in the form of a PPM file. This is a wonderfully simple format for writing that consist of a small header indicating the image type, resolution, and maximum color value. A single printf() suffices for this (238). Following this, the image data itself is simply a series of unsigned bytes in interleaved RGB order for each pixel and the pixels in left-to-right, top-to-bottom order (245–247). Because of the lossy format, the decompressed RGB values may fall outside the 0 to 255 range, so a small helper function is used to clamp and then write them:

109: K(b){
110: putchar(b<0?0:b>255?255:b);
111: }
223: int write_byte_clamped( int byte )
224: {
225:     putchar( byte < 0 ? 0 : byte > 255 ? 255 : byte );
226: }

YCgCo Color Space

The compressed image data is not actually encoded using the usual RGB color space, though of course it outputs this at the end. Instead, it uses the YCgCo color space. This is easily transformed back to RGB (244–247) but has some nice properties where image compression is concerned. Notably, it decorrelates the luma from the chroma data.

To see this, lets look at the just the Y (luma) color channel of the built-in image. We can do this by zeroing out the Cg (green–magenta) and Co (orange–blue) chroma channels before transforming back to RGB:

y.png

Note how the result is simply the image in greyscale.

We can also look at the Cg channel in isolation by setting the Y values in all pixels to 128 and the Co values in all pixels to zero. This will show us how the Cg data looks added to a neutral grey image:

cg.png

And we can do likewise to view the Co channel in isolation by setting the Y values to 128 and the Cg values to zero:

co.png

Hopefully it is clear now why this transform is useful: the majority of the actual image detail is contained in the Y channel with only a little bit in the Cg and Co channels. Moreover, the human eye is more sensitive to luminance than chrominance so by separating these out the encoder can compress them with different degrees of lossiness. This is why there are not one but two image quality factors (235–236).

Many lossy image and video codecs, such as JPEG, offer an option called chroma subsampling. When used, the luminance channel is typically stored at full resolution while the chrominance channels are downscaled and stored as a lower resolution image by the encoder, then upscaled by the decoder. The tiny image decoder doesn’t bother with this. Instead, the encoded images just crank up the lossiness factor for the chrominance channels (which JPEG images usually do too).

It’s also worth noting (241–243) that the decoder decompresses the Y, Cg, and Co channels to a planar format in memory before writing the transformed RGB values in the PPM-required interleaved form.

Quadtree Traversal

The actual extraction of the image to these planes in memory occurs completely within the function called at line 237. This is the main routine for the higher level decoding of the image and is by far the largest function in the program:

 61: W(z,l,g){
 62: int v,a,j,I,d,k,A,*o,c,B,q,C,h,w,J;
 63: if(g>5||g>2&&t(g-3)){
 64: c=1<<--g;
 65: e(a,4)W(z+a%2*c,l+a/2*c,g);
 66: }
 67: else{
 68: c=1<<g;
 69: d=c*c;
 70: q=n(73);
 71: e(A,3){
 72: o=y[A]+l*f+z;
 73: B=A>0;
 74: e(a,d)i[a]=0;
 75: e(a,d){
 76: if(t(61+g*2+B))break;
 77: a+=n(5+B*10);
 78: k=1-2*t(3);
 79: i[a]=k*(n(25+(B+(a<d/8)*2)*10)+1)*(A?P:O);
 80: }
 81: if(!q){
 82: v=0;
 83: e(a,c){
 84: v+=l?o[-f+a]:0;
 85: v+=z?o[a*f-1]:0;
 86: }
 87: *i+=z&&l?v/2:v;
 88: }
 89: R(i+d,1,i,1,c,c,10);
 90: R(o,f,i+d,c,1,c,10+g);
 91: if(q){
 92: C=q<17;
 93: w=C?9-q:q-25;
 94: e(a,c)e(j,c){
 95: e(I,2){
 96: h=a*w+w;
 97: J=h&7;
 98: h=(h>>3)+j+I;
 99: if(k=h<0)h=(h*8+w/2)/w-2;
100: h=h<c?h:c-1;
101: i[I]=k^C?o[h*f-1]:o[-f+h];
102: }
103: o[C?j*f+a:a*f+j]+=*i*(8-J)+i[1]*J+4>>3;
104: }
105: }
106: }
107: }
108: }
143: void extract_block( int left, int top, int level )
144: {
145:     if ( level > 5 || level > 2 && decode_bit( level - 3 ) )
146:     {
147:         int size = 1 << --level;
148:         for ( int corner = 0; corner < 4; corner++ )
149:             extract_block( left + corner % 2 * size,
150:                            top  + corner / 2 * size,
151:                            level );
152:     }
153:     else
154:     {
155:         int size = 1 << level;
156:         int area = size * size;
157:         int predictor = decode_unsigned( 73 );
158:         for ( int channel = 0; channel < 3; channel++ )
159:         {
160:             int *output = image[ channel ] + top * width + left;
161:             int is_chroma = channel > 0;
162:             for ( int coefficient = 0; coefficient < area; coefficient++ )
163:                 buffer[ coefficient ] = 0;
164:             for ( int coefficient = 0; coefficient < area; coefficient++ )
165:             {
166:                 if ( decode_bit( 61 + level * 2 + is_chroma ) )
167:                     break;
168:                 coefficient += decode_unsigned( 5 + is_chroma * 10 );
169:                 int sign = 1 - 2 * decode_bit( 3 );
170:                 buffer[ coefficient ] =
171:                     sign *
172:                     ( decode_unsigned( 25 + ( is_chroma + (
173:                         coefficient < area / 8 ) * 2 ) * 10 ) + 1 ) *
174:                     ( channel ? chroma : luma );
175:             }
176:             if ( !predictor )
177:             {
178:                 int dc = 0;
179:                 for ( int pixel = 0; pixel < size; pixel++ )
180:                 {
181:                     dc += top ? output[ -width + pixel ] : 0;
182:                     dc += left ? output[ pixel * width - 1 ] : 0;
183:                 }
184:                 *buffer += left && top ? dc / 2 : dc;
185:             }
186:             compute_idct( buffer + area, 1, buffer, 1, size,
187:                           size, 10 );
188:             compute_idct( output, width, buffer + area, size, 1,
189:                           size, 10 + level );
190:             if ( predictor )
191:             {
192:                 int is_leftward = predictor < 17;
193:                 int slope = is_leftward ? 9 - predictor : predictor - 25;
194:                 for ( int y = 0; y < size; y++ )
195:                     for ( int x = 0; x < size; x++ )
196:                     {
197:                         int weight;
198:                         for ( int neighbor = 0; neighbor < 2; neighbor++ )
199:                         {
200:                             int reference = y * slope + slope;
201:                             weight = reference & 7;
202:                             reference = ( reference >> 3 ) + x + neighbor;
203:                             int is_clipped = reference < 0;
204:                             if ( is_clipped )
205:                                 reference = (
206:                                     reference * 8 + slope / 2 ) / slope - 2;
207:                             reference = reference < size ?
208:                                 reference : size - 1;
209:                             buffer[ neighbor ] = is_clipped ^ is_leftward ?
210:                                 output[ reference * width - 1 ] :
211:                                 output[ -width + reference ];
212:                         }
213:                         output[ is_leftward ? x * width + y :
214:                                               y * width + x ] +=
215:                             *buffer * ( 8 - weight ) +
216:                             buffer[ 1 ] * weight + 4 >> 3;
217:                     }
218:             }
219:         }
220:     }
221: }

Let’s start at the top with the function header itself. The first two parameters provide the pixel coordinates of the upper-left corner of a square block to extract and the third parameter provides its level, used as a power of two to determine its size.

Extracting the image begins with a block at pixel (0,0) and a level large enough to cover the whole image (237). Note that this is why the image width must always be a power of two. While the height doesn’t need to be, the decoder will still extract a complete square’s worth of image data, it will just discard the lower rows when it writes the PPM (238–239).

The first thing this function does is decide whether to recursively split the block. If the block is larger than 32×32 (i.e., level 5) then it will always split it. If is down to 4×4 (i.e., level 2) then it will never split it. For all the levels in between, it decodes a single bit to determine whether to split it or not (for now, ignore the parameter to decode_bit()). Thus the recursion always terminates with block sizes between 4×4 and 32×32.

When it does recurse, it splits the block in half in each dimension and then decodes the quarters in order from upper-left to upper-right to lower-left to lower-right (147–151). The data for each of the leaf blocks is laid out consecutively in the entropy encoded bitstream and so this gives rise to an implicit quadtree organization for the image data. Here is the recursion process for the internal image and the order in which it is decoded (note that there are not actually any 32×32 leaf blocks in this image):

quadtree.gif

If we were to draw a line between each decoded block in this order, we get something closely related to a Z-order curve except with non-uniform levels across the image:

zorder.png

So why is this ordering useful? First of all, it lets the encoder choose different block sizes with only a single bit per node of the quadtree where splitting is an option. Regions with less detail may be encoded efficiently with larger blocks while regions with finer detail can go down to smaller blocks. Contrast this with JPEG images where the blocks are always 8×8 regardless of the amount of detail. For this particular image, the break down by the block size is:

Size Count
4×4 360
8×8 122
16×16 11
32×32 0

For a second reason, consider the locality of the ordering. Each successive block in the encoding tends to be fairly close spatially to the one before it. This means that each block has a better chance of being similar to the previous block and is therefore likely to compress better since compression is all about prediction. We can see this a bit better if we rearrange the blocks side-by-side in the order that they are decoded. Note how the adjacent blocks tend to be of similar size, color, and texture:

coherence.png

DC Prediction

In any event, when we reach a leaf node of the quadtree (153) it is time to actually extract a block to pixels in the image buffer. The first real step here is to read the prediction mode to use for this block (157). This is an integer between 0 and 33 and is reused for all three color channels comprising the block (158).

The job of the predictor is to use the already decoded pixels above and to the left of the current block to guess as much of the current block as possible. The more the decoder can do this on its own, the less the encoder has to encode into the bitstream.

Of the 34 prediction modes possible, 0 is special and represents the “DC prediction” mode (176–185). This simply takes the average of the group of pixels immediately above and to the left of the current block (just across the border from it). If the block is already at the top of the image then it uses only the pixels to the left and vice versa for blocks on the left of the image. This average is then written as a a constant value across the whole block. Note that there’s no obvious loop over the pixels to write the average into them. Instead, it uses a clever trick (184) that we’ll come back to later.

Of the 493 blocks in the built-in image, nearly half of them at 240 use the DC prediction mode. Here they are with the predicted pixels shown and the other 253 blocks blacked out (note that the far upper-left block does use the DC mode as well, it just predicts black):

dc.png

Directional Prediction

So what about the 33 prediction modes that the other 253 blocks use? These modes, numbered 1 through 33 indicate that the block should use directional prediction and the number indicates the direction itself. Here is how the mode numbers map to directions:

slopes.png

For example, mode 9 indicates that all the pixels to the immediate left of the block should be copied horizontally rightward across the block. Similarly, mode 25 will copy the pixels from just above the block downward across the block in vertical lines. Mode 17, on the other hand, will copy the pixels from both above and to the left downwards and to the right along diagonal lines. Modes 1 and 33 copy pixels along the other diagonal in opposite directions to each other, with 1 copying pixels up and to the right while 33 copies them down and to the left.

These correspond quite closely to the 33 directional intraprediction modes used in HEVC. HEVC has an additional “planar” prediction mode at 1 before the directional modes begin at 2, but otherwise the cardinal and diagonal directions match up (e.g., mode 1 here is equivalent to mode 2 in HEVC, mode 25 here matches mode 26 in HEVC, and so forth).

One notable difference is that in HEVC the direction are denser nearer to the horizontal and vertical directions and sparser along the diagonals. For simplicity the tiny decoder treats them more uniformly (192–193): each direction just steps in slope by 1/8th of a pixel with symmetry around the diagonal at mode 17.

Here are the 253 blocks of the built-in image that use the directional prediction modes and the pixels as predicted. The overlayed arrows mark the directions that the previously decoded pixels were propagated along:

directions.png

Note how some of the blocks with slightly off-angle directions show a nice antialiasing effect along the predicted edges? Let’s take a closer look at one of the blocks from the upper-right side of the image:

lerp.png

The directional prediction code loops over each pixel (194–195) in the block and then extrapolates along the direction vector back to the row of previously decoded pixels above it or to the column of pixels to the left of it (200–211).

Typically this will result in a position in between two pixels (e.g., the upper-left line in the above diagram). It handles this by looping (198) to fetch both of the neighboring pixels to either side and calculates a weight based on how close it is to each of them (201). Then it predicts the pixel as the linearly weighted blend of the two (213–216). Since the slopes of the direction vectors are in steps of 1/8th of a pixel, an integer weight between 0 and 7 with a divisor of 8 suffices. The blending gives an effect like anti-aliasing to the prediction.

If we take both the blocks predicted by the constant DC predictor and the blocks predicted by the directional predictions modes, here is what the predicted pixels contribute to the final image:

predictions.png

Residual Decoding

That actually makes for a surprisingly good approximation. But predicted pixels alone are insufficient. Otherwise we’d just be predicting from the black block in the upper-left corner and the entire image would always be predicted to be black. We need additional data both to prime the prediction and to correct it when it predicts wrongly. The corrected pixels also allow the predictions in future blocks to do a better job (as seen above).

These corrections are called the residuals, and they are stored here not as direct pixel values but as coefficients computed by the discrete cosine transform (DCT), a near relative of the discrete Fourier transform (DFT). The former uses only real numbers and can be computed using the later which uses complex numbers.

The DCT is at the heart of the JPEG format and most video codecs. We’ll get into the details of how it works soon, but for now the key things to know are that it is reversible, it is relatively robust against quantization, and that after quantization the data for natural images can be fairly sparse.

Because they can be so sparse, the format uses a form of run-length encoding to store groups of coefficients that have been quantized to zero. Decoding of the coefficients (162–175) starts by zeroing out space in a working buffer, then loops over the non-zero coefficients. For each one, it first decodes a single bit that flags whether the last non-zero coefficient in the block has already been read. Then it reads the number of zero coefficients to skip. Then it reads the quantized level (i.e., the quotient after dividing the original coefficient by the quantization step size). Since the entropy encoding scheme only stores unsigned integers, part of this involves reading a sign bit. Finally, since this is a simple mid-tread uniform quantization scheme with no dead-zone, the coefficient is reconstructed by simply multiplying the level with the quantization step size and then the decoded coefficient is stored in the working buffer.

This scheme bears some resemblance to H.263 which encodes the coefficients as (LAST, RUN, LEVEL) triplets. However, the LAST flag indicates whether the current coefficient is the last non-zero entry whereas the equivalent bit in the tiny decoder indicates whether it has already read the last entry. JPEG images also store their DCT coefficients using a run-level pairs to indicate the number of zeroes to skip. One key difference is that JPEG and many video codecs serialize the coefficients of the block in a zig-zag order since the non-zero coefficients tend to be in the upper-left (as we’ll see shortly). For simplicity, the tiny decoder just uses a simple row-major ordering.

Note that there are separate quantization step sizes for luma and chroma (174) and recall that these are the image quality factors that were read out of the small header before the block data. Higher numbers simply divide the coefficients more coarsely resulting in smaller levels and more zeroes and therefore a smaller compressed image. Also, for simplicity these quantization step sizes are constant. More complex codecs such as JPEG use the image quality factor more indirectly; each coefficient within a block uses its own divisor looked up from a quantization matrix, and the image quality factor chooses the quantization matrix. Video codecs will often go farther and dynamically adjust the quantization matrices according to the bit rates apportioned to each frame.

Here is a heat-map showing the blocks and the absolute values of the decoded coefficients from the luma plane of the built-in image. It is normalized to the second-largest coefficient (near the middle) in the image which has a level of 14. This produces a coefficient of 574 after multiplying by the luma quantization step size of 41. For comparison, the first-largest coefficient is in the very upper-left and has a level of 29:

dct.png

To see the structure of the coefficients in each block a little more clearly, we can normalize each block by its own largest value. This makes it more evident just how sparse the non-zero coefficients are as well as how they tend to cluster in the upper-left of each block:

dctnorm.png

Inverse DCT

From here, the next step is to take the coefficients and invert the DCT to get the residuals back. But first, we’ll take a look at what the DCT actually does. You can find the math in detail elsewhere; here I’ll simply focus on giving the intuition for it.

Briefly, the two-dimensional DCT converts the block into the weights for a weighted sum of a set of sinusoidal patterns. The coefficients correspond to how strongly each pattern contributes and whether it is inverted. Here is a visualization showing the patterns for the DCT of an 8×8 block; it uses blue for positive values and red for negative values, with dark grey for zero. Note that there are, in fact, 8×8 or 64 of them (one for each entry in the block – larger or smaller blocks have correspondingly more or fewer patterns):

frequencies.png

There’s a clear structure to these patterns (or basis images as they are often called). In each of the patterns in the left-most column, there’s no horizontal variation at all. In the second column from the left, there is a single vertical line through the middle of each where the pattern crosses zero once. In the third column from the left, there are two vertical lines through them where they cross zero and divide the pattern into thirds. Each column adds one more zero-crossing, until the right-most column has a zero-crossing between each pixel. A similar structure holds as we examine the rows of patterns from top to bottom. In other words, for the 2D DCT, every basis pattern is simply the product of a horizontal and a vertical sinusoid.

Each column of patterns corresponds to a particular horizontal frequency and each row to a particular vertical frequency. In each case, they range from the zero frequency on the left and at the top, up to the Nyquist frequency on the right and bottom. This is sometimes called the Nyquist limit and arises because we can’t meaningfully have more zero-crossings than pixels.

Layed out this way, the pattern in the upper-left has no variation at all and is constant throughout. The first coefficient out of the DCT is therefore sometimes called the DC coefficient. Meanwhile, all the other coefficients correspond to patterns with zero-crossings and so are sometimes called AC coefficients.

This is also where the clever trick for DC prediction alluded to earlier comes into play. Rather than looping over the pixels in the block to add the predicted average to them, it is simply added to the DC coefficient (184).

So in short, decoding the pixels from the coefficients is a matter of taking each coefficient, computing the corresponding pattern and weighting it by that coefficient via multiplication, then summing them all together. This is the job of the inverse DCT, or IDCT:

idct.gif

Hopefully, this gives some intuition for why the non-zero coefficients tend to cluster in the upper-left of the block. The low-frequency patterns in the upper-left form a blurry base image while the high-frequency patterns to the right and bottom fill in detail. Most gradients arising in a typical image are fairly smooth and so bear a closer resemblance to the patterns in the upper left than they do to the patterns in the lower-right.

Likewise, this should also give some intuition for why the transform is somewhat robust to quantization. A small change to one coefficient just makes the approximation to the original image a bit less accurate. It can even be tolerant to simply zeroing some coefficients: if we were to zero out a pixel in the regular image, there’d be a noticeable speck where it was missing. But zeroing a coefficient distributes the quantization error throughout the block and affects all pixels alike. This makes it harder for the eye to notice.

There are several variants of the DCT and it turns out that they come in pairs where the one is the inverse of the other. The coefficients here were computed using the DCT-II variant, and so the tiny decoder implements the matching DCT-III variant to invert it:

52: R(S,Y,T,Z,U,d,V)int*S,*T;
53: {
54: int G,a,j,H;
55: e(H,d)e(a,d){
56: G=0;
57: e(j,d)G+=T[j*Z+H*U]*lrint(cos(acos(-1)/d*(a+0.5)*j)*sqrt(2-!j)*1024);
58: S[a*Y+H*U]=G+(1<<V-1)>>V;
59: }
60: }
123: void compute_idct( int *samples, int point_stride,
124:                    int *coefficients, int frequency_stride,
125:                    int sequence_stride, int size, int scale )
126: {
127:     for ( int sequence = 0; sequence < size; sequence++ )
128:         for ( int point = 0; point < size; point++ )
129:         {
130:             int sum = 0;
131:             for ( int frequency = 0; frequency < size; frequency++ )
132:                 sum += coefficients[ frequency * frequency_stride +
133:                                      sequence * sequence_stride ] *
134:                     lrint( cos( acos( -1 ) / size *
135:                                 ( point + 0.5 ) * frequency ) *
136:                            sqrt( 2 - !frequency ) * 1024 );
137:             samples[ point * point_stride +
138:                      sequence * sequence_stride ] =
139:                 sum + ( 1 << scale - 1 ) >> scale;
140:         }
141: }

Because of the need for cosines, this is the only place in the tiny decoder where floating-point math is used. Everything else uses either integer arithmetic or, as in the directional predictor, fixed-point arithmetic. Only a handful of unique cosines are actually ever computed, however, so everything inside the lrint() subexpression could be replaced with either a 33-element table of integers with symmetry or a 128-element table without symmetry (IDCTs on blocks smaller than 32 could have used the same table by looking up every second, fourth, or eight entry). If this had been done, the decoder would have needed no floating-point math at all and would run a bit faster. However, storing the table of cosines in the code would have added significantly to its size.

This is also the one place in the tiny decoder’s format where the compression is actually lossy. There is a minor bit of error introduced by rounding the floating point values to integers and the use of fixed-point arithmetic. The DCT can be expressed as a matrix-vector multiplication where the integer cosines form the matrix (and the IDCT just uses the transpose). In fact, HEVC specifies its DCT in terms of this matrix (which lets implementations be bit-exact to the specification), albeit using 6-bit fixed-point that is cheaper in hardware rather than the more precise 10-bit fixed point of the tiny encoder. Looked at this way, the matrices for both are close to orthogonal but not quite (though the tiny encoder with more fixed-point bits has the edge here) meaning that not everything will round-trip through the DCT and IDCT perfectly. But this is dwarfed by the quantization error which is determined by the quantization step sizes.

This function on its own does not actually implement a full two-dimensional IDCT. Instead, it takes a set of strides for indexing and loops (127) to perform a sequence of one-dimensional IDCTs. The first call to it (186–187) does one-dimensional IDCTs on each of the rows in the block, reading the decoded coefficients from the lower part of the working buffer and writing the results into an upper part of the working buffer. The second call (188–189) then reads these back in, performs one-dimensional IDCTs on the columns, and writes the results directly into the correct place into the array that holds the decoded image.

This decomposition of the multi-dimensional DCT or IDCT into a sequence of one-dimensional transforms is a fairly common way to implement the multi-dimensional transform due to its simplicity. Interestingly, it does not matter whether the rows are done first and then the columns (as is done here) or vice-versa. This property is called separability.

As an aside, it is possible to losslessly flip and do 90-degree rotations on square blocks that have been transformed by the DCT without re-transforming. To flip the block horizontally, one just has to negate the coefficients in the odd-numbered columns (i.e., not the column with the DC coefficient). Likewise, to flip the block vertically, one negates the coefficients in the odd-numbered rows. Note that these correspond to the patterns in the image above that have odd symmetry about the vertical or horizontal axis, respectively. Furthermore, transposing the coefficients will transpose the image block. These can be combined to create 90-degree rotations.

Putting this all together, here is what the recovered residuals from the built-in image look like after the IDCTs. Because they have both positive and negative values, they have been added to a mid-grey image:

residuals.png

If they are instead added to the predicted image from before, the result is the final image as shown at the top of this post. The residuals provide data for the predictors to predict from and the predictors lessen the amount of data that the residuals need to encode. Note the the upper-left block in the residual image here is quite bright because it has no prediction data yet to add to.

Exponential-Golomb Coding

So that completes the higher level part that decodes the lossily compressed pixels and outputs the image. Back in the beginning of this post, I’d mentioned that there was also a lower level entropy encoding scheme under that for the lossless compression of single bits and unsigned integers. We’ll take a look at that now, beginning with the decoding of the integers:

44: n(E){
45: int a,b;
46: a=0;
47: while(!t(E+a))a++;
48: b=1;
49: while(a--)b=b<<1|t(4);
50: return b-1;
51: }
112: int decode_unsigned( int context )
113: {
114:     int length = 0;
115:     while ( !decode_bit( context + length ) )
116:         length++;
117:     int value = 1;
118:     while ( length-- )
119:         value = value << 1 | decode_bit( 4 );
120:     return value - 1;
121: }

First, it’s worth noting that the lossless compression only directly deals with single bits. The code that reads an unsigned integer just sits on top of this, iteratively reading a bit at a time until it has extracted a full integer.

The integers use a form of variable length code called exponential-Golomb coding introduced by Teuhola. Here is how the first sixteen are represented:

Value Encoding
0 1
1 010
2 011
3 00100
4 00101
5 00110
6 00111
7 0001000
8 0001001
9 0001010
10 0001011
11 0001100
12 0001101
13 0001110
14 0001111
15 000010000

In short, each encoding is formed by incrementing the value and then prefixing the result with as many zeroes as there are digits after the leading 1 bit. Consequently, each encoded number always has an odd number of bits with a 1 bit in the center and no 1 bits before that.

This allows smaller integers to be encoded with fewer bits, while also allowing the decoder to always know where the encoding for an integer will end, simply by counting the initial 0 bits.

Exponential-Golomb encoding seems to be a fairly popular method in video codecs. It is used in H.264, H.265, Dirac, and even the new AV1.

In the tiny image decoder, it is used to decode integers in four places: the header, the block predictor number, the length of runs of zero coefficients, and the quantized levels of the coefficients themselves.

Despite theoretically allowing arbitrarily large integers, the integers encoded in the built-in image data are not particularly big. The largest two both belong to the header where they store the luma and chroma quantization step sizes, 41 and 82, respectively. After this, the third largest is 37, from a run of 37 zero coefficients. Everything else is between 0 and 33. As a matter of curiosity, here is what that distribution looks like:

integers.gif

The grey line here shows the implied distribution from the bit length of the Exponential-Golomb encoded value. For example, since a 0 value encodes to a single bit, one might expect zeroes to comprise half the integers encoded. Likewise, the encodings for 1 and 2 take three bits each so one might expect them each to comprise 1/8th of the integers and so forth. Where the actual distribution matches this, the encoding is optimal.

Use Integers
Header 4
Predictors 493
Runs 1502
Levels 1502
Total 3501

Collectively, the 3501 integers stored in the built-in image consume 7393 bits from the entropy encoded bitstream for an average of 2.11 bits each. Of these, 2420 of the integers (69% of them) are zeroes requiring only a single bit.

Binary Arithmetic Coding

The next piece is the code responsible for decompressing and returning the individual entropy encoded bits. These are stored using a form of binary arithmetic coding.

The general idea of arithmetic coding is to recursively divide up the 0 to 1 interval into pieces whose size is proportional to the probability of each value that might be encoded. For binary arithmetic coding, there are only two possible values and so the interval is first divided into two (possibly uneven) pieces. Depending on whether we want to encode a 0 or 1 bit, we select either the first or second of the two pieces and then subdivide it again into two pieces whose size depends on the probabilities of the second bit being a 0 or 1. One of these is then chosen and then the process repeats for the third and subsequent bits. In the end, we’re left with an interval with very precise fractions for the lower and upper bounds and we can emit any fraction in between to represent all of the encoded bits.

To decode we just take the initial 0 to 1 interval, and divide it into the same pieces as we would for encoding. But then we compare the fraction that we have to the pieces and see which one it falls into. The value associated with that piece gets emitted as the decoded value. Then we subdivide that piece in turn and see which part the fraction falls into. We keep dividing whichever pieces the fraction falls into until we’ve decoded the whole message.

That’s the theory. In practice, this all leads to very exact numbers which would require arbitrary precision arithmetic to handle. What’s often done instead is to keep only the tail ends of the fractions (represented with simple integer arithmetic) and have the encoder periodically write out the most significant digits as soon as they become settled. Then those digits can be shifted out in order to make room for new least significant digits. This is called renormalization. Correspondingly, the decoder matches this by shifting in new least significant digits that are read from the input.

Let’s look at how the tiny image decoder’s version of an arithmetic decoder works. The decoder for the version used here is elegantly simple in its implementation:

26: t(E){
27: int u,k,F,*m;
28: if(r<s)Q();
29: m=X+E*2;
30: F=*m+m[1]+2;
31: u=r*(*m+1)/F;
32: if(k=p>=u){
33: p-=u;
34: r-=u;
35: }
36: else r=u;
37: m[k]++;
38: if(F>63){
39: *m/=2;
40: m[1]/=2;
41: }
42: return k;
43: }
 88: int decode_bit( int context )
 89: {
 90:     if ( range < radix )
 91:         renormalize_and_read();
 92:     int *counts = models + context * 2;
 93:     int observations = *counts + counts[ 1 ] + 2;
 94:     int split = range * ( *counts + 1 ) / observations;
 95:     int bit = fraction >= split;
 96:     if ( bit )
 97:     {
 98:         fraction -= split;
 99:         range -= split;
100:     }
101:     else
102:         range = split;
103:     counts[ bit ]++;
104:     if ( observations > 63 )
105:     {
106:         *counts /= 2;
107:         counts[ 1 ] /= 2;
108:     }
109:     return bit;
110: }

The range represents the size of the last interval, and the fraction represents the current position of the fraction relative to that range. In other words, the fraction and range represent the numerator and denominator, respectively, for the fraction — relative to the last interval! At each step, the current expected probability that the next bit is a 0 is used to split the current range into the two pieces (92–94). Which pieces the fraction falls into determines the actual bit to return (95), and then the other piece is trimmed away from the fraction and range (96–102). This means that at each step, the range gets smaller and smaller. When it gets too small for the fraction and range to have sufficient accuracy, i.e., the range is down to a single digit in our radix (90), then we renormalize by multiplying both the fraction and range by the radix and then read the next digit of input and add it to the fraction (91). The fraction and range are initialized to 0 and 1, respectively, and recall that one of the first things the program does is to renormalize and read the first digit of input (231).

As an example, suppose we have the decimal fraction 0.631 which means that we’re reading in decimal digits, 0 to 9, and so our radix is 10. We’ll also suppose that the expected probability of a 0 bit is constant at 80%, leaving the probability of a 1 bit at 20%. Here’s a diagram showing how the fraction 0.631 decodes to the bit string 001101. Each row shows the state each time through before (96) trimming the empty piece. Renormalization and the reading of each of the three digits occurs before the first, second, and fifth bits:

arithmetic.png

Hopefully it’s clear that if we strongly expect either a 0 bit or a 1 bit and we get it, then the range only decreases by a little bit and so we can go a good while before needing to renormalize and read more input. On the other hand, if we were strongly expecting a bit and we were wrong then the range will be reduced significantly and we’ll very quickly need to renormalize and read again; we’ll effectively have consumed several bits of input to decode a single unexpected bit. For the case where we’re uncertain what the next bit will be and expect either one with about a 50% probability, the range will be roughly halved either way and so we’ll have consumed about one bit of input to decode one bit of output. The number of bits of input consumed to decode a single bit is approximately –log2(P), where P is the probability with which we were expecting that bit. In other words, arithmetic coding can encode a bit at a rate pretty close to its theoretical information content.

Regarding the built-in image, the actual input is in base 1968 for the internal data (230). There are a total of 893 such digits (or symbols) encoded in the string data so together these mean that there is approximately 9772 bits worth of compressed information. These decompress to 12152 bits read by the image layer, so the arithmetic compression saves about 19.59%. Of the decompressed bits, 4759 are decoded as single bits while the remaining 7393 go to exponential-Golomb coded integers.

Context Modeling

Since getting good lossless compression out of the binary arithmetic coder relies on having an accurate estimate of the probability of a 0 or 1 for each bit decoded, the next thing we’ll look at is how it does this.

Each call to decode a bit passes along a context (88). This number corresponds to a particular context “bin” in which it keeps a pair of approximate counts (92), the first for the number of 0 bits its seen recently for that particular context, and the second for the number of 1 bits.

The naive approach would be to estimate the probability of a 0 bit as the ratio of the number of 0 bits seen to the total number observations (i.e., the number both 0 and 1 bits seen). However, this isn’t what it actually does. Instead it adds one to the number of 0 bits seen and adds two to the total number of observations before taking their ratio (93–94)!

This is estimating the probability using Laplace’s rule of succession. Effectively, it starts the counts off by pretending that it’s seen one of each bit. When there are no real counts (and therefore no history) this leads to it estimating a balanced 50% probability for a 0 bit. And even if it has seen all 0 bits or all 1 bits, it will still estimate a small probability that the next bit will be different. This later property is crucial to arithmetic encoders because if the predicted probability of a value is 0% then it becomes impossible to encode that value.

Once the bit has been decoded, the count for that bit is incremented (103).

Periodically, when the total of the two counts gets large enough, the current counts are halved (104–108). This is periodic scaling, and it has the nice benefit making recently decode bits have a greater effect on the probability estimate than older bits, since the effect of the previous bits is halved with each scaling. With this scheme, the probability acts a bit like the exponential weighted average of the decoded bits. More importantly, it allows the estimate to adapt to local changes in the distribution. Imagine that we’d seen a thousand 0 bits followed by fifty 1 bits. It would likely be better to predict another 1 bit next rather than a 0 bit despite the fact that we’ve seen far more of the later overall.

For the context modeling of the exponential-Golomb coded integers, a scheme called “binarization” is used where the code is broken into its individual bits and different ones may be assigned to different bins. In the tiny image decoder, an initial context bin is passed to decode_unsigned() and that bin plus the subsequent ones are used for each bit of the prefix that determines the bit length of the integer (115). For the actual bits of the integer values, however, bin 4 is always used (119).

The tiny decoder’s context modeling has space for 166 counts to store 83 context bins. Here is how those context bins are assigned and used within the built-in image and the three accompanying image files:

Context Purpose Line
0 Split 8x8 blocks? 145
1 Split 16x16 blocks? 145
2 Split 32x32 blocks? 145
3 Negative DCT coefficient? 169
4 Unsigned integer values 119
5–8 Image size 232
5–10 Luma quantization 235
5–11 Chroma quantization 236
5–13 Image height 234
5–13 Luma DCT zero run 168
15–23 Chroma DCT zero run 168
25–29 High-freq. luma DCT coefficient 172
35–38 High-freq. chroma DCT coefficient 172
45–52 Low-freq. luma DCT coefficient 172
55–60 Low-freq. chroma DCT coefficient 172
65 End of 4x4 block luma DCT? 166
66 End of 4x4 block chroma DCT? 166
67 End of 8x8 block luma DCT? 166
68 End of 8x8 block chroma DCT? 166
69 End of 16x16 block luma DCT? 166
70 End of 16x16 block chroma DCT? 166
71 End of 32x32 block luma DCT? 166
72 End of 32x32 block chroma DCT? 166
73–78 Prediction mode 157

Note that there are gaps and not all of the context bins are actually used by these four image in practice. The ones in the 5–65 range potentially could be used by other images encoded for this program. However, since there are only 34 prediction modes, it doesn’t seem like context bins 79 through 82 could ever be used. I suspect that this was an oversight.

So how effective is all this at predicting the bits to be decoded? If we define a correct prediction as estimating the next bit to decode as having a 50% or greater probability (and so surpassing the break-even point by approximately consuming less than a bit of input to decode due) and a misprediction as estimating the next bit as having less a 50% probability (and therefore costing more than a bit of input to decode), then on the 12152 bits of the built-in image it correctly predicts a 0 bit 3465 out of 5364 times (64.6%) and a 1 bit 5138 out of 6788 times (75.7%).

We can break this down further into deciles of the estimated probability. The worst possible probability that this implementation can estimate a bit at is 1/64, or 1.56%, which will cost about –log2(1/64) = 6 bits of input to decode one bit of output. On the other end, the best possible probability is 63/64 or 98.44%, which will cost only about –log2(63/64) ≈ 0.023 bits of input:

modelling.png

Input Symbols

Finally, we come to the last function. This is the function that the binary arithmetic coder calls to renormalize by multiplying both the fraction and range by the radix and then read the next digit (or symbol) of input and add it to the fraction:

 9: Q(){
10: int b;
11: r*=s;
12: p*=s;
13: if(s>>9){
14: if(b=*x){
15: p+=b-1-(b>10)-(b>13)-(b>34)-(b>92)<<4;
16: b=x[1];
17: x+=2;
18: p+=b<33?(b^8)*2%5:(b^6)%3*4+(*x++^8)*2%5+4;
19: }
20: }
21: else{
22: b=getchar();
23: p+=b<0?0:b;
24: }
25: }
63: void renormalize_and_read()
64: {
65:     range *= radix;
66:     fraction *= radix;
67:     if ( radix >> 9 )
68:     {
69:         int byte;
70:         if ( byte = *lena )
71:         {
72:             fraction += byte - 1 -
73:                 ( byte > 10 ) - ( byte > 13 ) -
74:                 ( byte > 34 ) - ( byte > 92 ) << 4;
75:             byte = lena[ 1 ];
76:             lena += 2;
77:             fraction += byte < 33 ? ( byte ^ 8 ) * 2 % 5 :
78:                 ( byte ^ 6 ) % 3 * 4 + ( *lena++ ^ 8 ) * 2 % 5 + 4;
79:         }
80:     }
81:     else
82:     {
83:         int byte = getchar();
84:         fraction += byte < 0 ? 0 : byte;
85:     }
86: }

The renormalization (65–66) is straightforward, as is the code to read the next digit of input from an external image. Recall that for an external file (230), the radix is 256 so it simply reads in and adds a full byte’s worth data (83–84). Calls to read past the end of the file are treated as reading zeroes.

The handling of the data for the internal image is the rather more complex. The symbols here are stored in a string using a base 1968 system (230). Each symbol is read using either two or three characters from this string.

The first character of the symbol encodes a value between 0 and 122 using the entire 7-bit character set except for '\0', '\n', '\r', '"', and '\\' (72–74). This is shifted into the upper bits of the symbol’s value.

The second character of the symbol dictates whether there will be a third character read and together these determine the lower nybble of the symbol’s value (75–78). While this code allows for many possible character sequences that could be used represent a given nybble, here are the sequences actually used in the string data for each nybble:

Sequence   Nybble
  ' ' (32) 0
  '\v' (11) 1
  '\t' (9) 2
  '\f' (12) 3
'}' (125) ' ' (32) 4
'}' (125) '\v' (11) 5
'}' (125) '\t' (9) 6
'}' (125) '\f' (12) 7
';' (59) ' ' (32) 8
';' (59) '\v' (11) 9
';' (59) '\t' (9) 10
';' (59) '\f' (12) 11
'{' (123) ' ' (32) 12
'{' (123) '\v' (11) 13
'{' (123) '\t' (9) 14
'{' (123) '\f' (12) 15

Together, the upper and lower bits allow for 123×16=1968 possible values, hence the unusual radix for the internal data.

But why such an peculiar and verbose encoding for the lower nybble? The answer to that likely has to do with the rules of the IOCCC at the time:

Your entry must be <= 4096 bytes in length. The number of octets excluding whitespace (tab, space, newline, form-feed, return), and excluding any ’;’, ’{’ or ’}’ followed by whitespace or end of file, and where C reserved words, including a subset of preprocessor words, count as 1 byte, must be <= 2053 (first prime after 2048).

This entry occupies a total of 3984 bytes, while its total size as measured by the secondary count is 2051 – just under the limit of 2053. The encoding of the lower nybbles makes them mostly “free” under this second count; of the 893 symbols stored in the 2439 bytes of string data, the IOCCC size tool measures the size of the string as 860 by this secondary count.

If we consider the number of raw bytes that must be read from the string when decoding each block of the internal image, then the cheapest block consumes an average of 0 bits from the string per pixel while the most expensive block clocks in at 7.5 bits per pixel. The average over the whole image is 1.191. If instead we look past the funky string encoding of the symbols and measure each symbol as being worth log2(1968) bits and count the current range value as a partially consumed symbol worth log2(19682 / range) bits, then the cheapest block requires just 0.0365 bits per pixel or so of storage, while the most expensive block requires about 3.4882 bits per pixel. The average over the whole image by this measure is 0.596 bits per pixel. Here is another heat-map showing proportionally how those bits are allocated:

density.png

Globals

The very last bit of code to see here is mainly just the global variables, including the string with the encoded image data. Nothing here should be surprising at this point. You can clearly see all of the braces, semicolons, and whitespace encoding the lower nybbles of each symbol in the string data:

1: #include <math.h>
2: #define e(a, d) for(a=0;a<d;a++)
3: char*x="    {k/;  y{ q ; }    c {    @;  ={     S}  c}  W;;    {4}   k |; w{   +9;{; 8; 9{   S;  /}  y{ K}   {;}   l{    { ~{ ;    V}k}g< t{    E   v;M{ B}y}   <{7;/;  Y} t}kp; Y} $Ha{e} w};} R} /{>}a    ;} ;    `   $W-}  D}B; e;f;*;    ~;A;s O{  o;>{1; m{ `} R}]{ T} v}={ I} ; }a?&; A}$;W;R{u} `; j}W;s{e}    A;[    R;  X  P; 4 ,F;({<8{#;%}@J{)} }o^*{u/{'}]{    *}    }  ;{ r}    f   /;}e} }w{ ${{;,; @ d  $}];>(}  I{ d}   &;  U}  {  y;Y}   { P{   R} T}_{ }R } l  { T}"
4: "'; |; ${=}  H} (}}8{cp{ s} #}+}   3}kF}<H     .{ }G}x;    r   D c{; W; { b;6; k{}B;*};    ]} ~    { ;;} !}}  x}v}n;^; 6V}Y{ h; ~    %*}! H; G{ r{ f;Y{ i}z} N  %}.{;    (  v} _}   h; 7;<}    ^;Z;0; ; <;<; M; N{    }   _{O} !{f{]{M{;A{}    0;S}${    @;x}y}@  L;1    t{ 3{c{s{_{  `{  D{ ]}!;    ${  _J;v+ } 3{B; ]{  }  E6 .x{?+; {x; }v{$};6}T; O; ; (}X7} j; @} :}#  c{ !{ }x  KXt} >; ?{ c; ;  W;  ; l;} h}p}  i{ %    }P}   /{  *}    %L; ; !{ S{ n} "
5: "x;  { 1  J;v{   U}({   @ X{ k} H;4;e J   6;;v; G{{]    &{A d{ lM{;K;;    4-{}} p h{; {  rW;    v{;   f}  }1{^&{9{{ ;~;n;q{    9 R   6{  { u;a; ;  U;  ;Y}   +}}2sk; 8  {  JK;'i;  ;$;    W{  P!{{{P    } [;   (;Q; Un;+}g{C;{{ ; <{   vS} b;6`} ?{+   %;  }n;q{ r}k; ;{c{ S} 2}~{    4;RW v} R;  kI}|; d; [ O}5; ;;}Z d   { {&;h    o{ V    v ;    _{{/}  F{f{r{4{{?{ 4;S}  :;];E}  ;  &} #e !{>{H; {O{ 0;} H;  p; w}>{1}{  -} 4;"
6: "S}}  u L{ y} %;2  |{(}    /;,{ )}Y;g}    G}v;T}    };}i {{};[{ E{q} g;T{ ={}R;   k{ j;_;h}gPc;({   F;6}   }} 3    ,}<; 0  P;{'t}u};      }U}s{8{ E} >{}E {G{H :{  Yog}   }F  D{ R{     -;M?;= q}_ U  { ;    I   { |{{}   1{,}{ x{{ U{ s;J}}    6{>7;,{ D{  {{ ;]}    ;M; &}{ V}  n{&  T~;({ }[;   r{#    u{X 9;L; Uf})}   {T}     p{  N;  >{  >   }}D} m{1{  {}X; o}   w}$}   ^v} K  f  ,}  ^3; { @{_} _{    o;  4}  h}H;#.{  {}  ;  <{ {G{ $;{ "
7: "z {a{{D;  ?|}{{ ;   `} } Q}j;4}   3{Q}   {  * ;}r{a}   } R{p @;  N{ {f; A;8}L   $}{ }}J{ }    k{r} { [; -;p{ I{ { &}J;   T}  ?{Z{>;    5>; ];  wz ^}    u;);   H} ; L &;  V   E{1{g;C} V} ~;U; ^{   J; { /}    {;(}y} aK /}   .};K;N{w{ `{   }T{l`; #;N{lX; ?; +}{     w{  ;  q;  z;_;y} 8}    &{X}   V{ WG} ,; [}U{  v{  Q;  w{    [ Y}N    Yu i{ {!A{}{ b0;   X~} ;-; 8{   E }  ;F{   y{}{ ";
8: int y[3][1<<20],i[2048],X[166],p,r=1,f,s,O,P;
 1: #include <math.h>
 2: #include <stdio.h>
 3: 
 4: char *lena =
 5:     "   {k/;  y{ q ; }    c {    @;  ={     S}  c}  W;"
 6:     ";  {4}   k |; w{   +9;{; 8; 9{   S;  /}  y{ K}   {"
 7:     ";} l{    { ~{ ;    V}k}g< t{    E   v;M{ B}y}   <{7;"
 8:     "/;   Y} t}kp; Y} $Ha{e} w};} R} /{>}a    ;} "
 9:     "; `   $W-}  D}B; e;f;*;    ~;A;s O{  o;>{1; m{ `} "
10:     "R}]{ T} v}={ I} ; }a?&; A}$;W;R{u} `; j}W;s{e}  A;"
11:     "[  R;  X  P; 4 ,F;({<8{#;%}@J{)} }o^*{u/{'}"
12:     "]{ *}    }  ;{ r}    f   /;}e} }w{ ${{;,; @ "
13:     "d  $}];>(}  I{ d}   &;  U}  {  y;Y}   { P{   R} T}"
14:     "_{ }R } l   { T}"
15:     "'; |; ${=}  H} (}}8{cp{ s} #}+}   3}kF}<H     .{ }"
16:     "G}x;  r   D c{; W; { b;6; k{}B;*};    ]} ~    { ;"
17:     ";} !}}   x}v}n;^; 6V}Y{ h; ~    %*}! H; G{ r{ f;Y{ i}z} N "
18:     " %}.{; (  v} _}   h; 7;<}    ^;Z;0; ; <;<; M; N{"
19:     "   }   _{O} !{f{]{M{;A{}    0;S}${    @;x}y}"
20:     "@  L;1    t{ 3{c{s{_{  `{  D{ ]}!;    ${  _J;v+ "
21:     "} 3{B; ]{  }  E6 .x{?+; {x; }v{$};6}T; O; "
22:     "; (}X7} j; @} :}#  c{ !{ }x  KXt} >; ?{ c; ;  "
23:     "W; ; l;} h}p}  i{ %    }P}   /{  *}    %L; ; !{ S{ n} "
24:     "x;  { 1  J;v{   U}({   @ X{ k} H;4;e J   6;;v; G{{"
25:     "]  &{A d{ lM{;K;;    4-{}} p h{; {  rW;    v{;   "
26:     "f} }1{^&{9{{ ;~;n;q{    9 R   6{  { u;a; ;  "
27:     "U; ;Y}   +}}2sk; 8  {  JK;'i;  ;$;    W{  P!{"
28:     "{{P    } [;   (;Q; Un;+}g{C;{{ ; <{   vS} b;6`"
29:     "} ?{+    %;  }n;q{ r}k; ;{c{ S} 2}~{    4;RW v} "
30:     "R; kI}|; d; [ O}5; ;;}Z d   { {&;h    o{ V    v "
31:     "; _{{/}  F{f{r{4{{?{ 4;S}  :;];E}  ;  &} #e "
32:     "!{>{H; {O{ 0;} H;    p; w}>{1}{  -} 4;"
33:     "S}}  u L{ y} %;2  |{(}    /;,{ )}Y;g}    G}v;"
34:     "T} };}i {{};[{ E{q} g;T{ ={}R;   k{ j;_;h}gPc;({   F;"
35:     "6} }} 3    ,}<; 0  P;{'t}u};      }U}s{8{ E} >{}"
36:     "E  {G{H :{  Yog}   }F  D{ R{     -;M?;= q}_ U  { "
37:     ";   I   { |{{}   1{,}{ x{{ U{ s;J}}    6{>7"
38:     ";,{ D{    {{ ;]}    ;M; &}{ V}  n{&  T~;({ }[;   r{"
39:     "#  u{X 9;L; Uf})}   {T}     p{  N;  >{  "
40:     ">  }}D} m{1{  {}X; o}   w}$}   ^v} K  f  "
41:     ",} ^3; { @{_} _{    o;  4}  h}H;#.{  {}  "
42:     "; <{ {G{ $;{ "
43:     "z {a{{D;  ?|}{{ ;   `} } Q}j;4}   3{Q}   "
44:     "{ * ;}r{a}   } R{p @;  N{ {f; A;8}L   $}{ }}J{ "
45:     "} k{r} { [; -;p{ I{ { &}J;   T}  ?{Z{>;    5>; "
46:     "];  wz ^}    u;);   H} ; L &;  V   E{1{g;C} "
47:     "V} ~;U; ^{    J; { /}    {;(}y} aK /}   .};K;N{w{ "
48:     "`{ }T{l`; #;N{lX; ?; +}{     w{  ;  q;  z;_;"
49:     "y} 8}     &{X}   V{ WG} ,; [}U{  v{  Q;  w{    ["
50:     "  Y}N    Yu i{ {!A{}{ b0;   X~} ;-; 8{   E }  ;F{   y{}"
51:     "{ ";
52: 
53: int image[ 3 ][ 1 << 20 ];
54: int buffer[ 2048 ];
55: int models[ 166 ];
56: int fraction;
57: int range = 1;
58: int width;
59: int radix;
60: int luma;
61: int chroma;

The static allocation for the image planes limits the maximum image size to 1024×1024, which is 220. The working buffer for the DCT input and output is sized to twice the maximum block size of 32×32. And of course, the models hold space for 83 pairs of counts though the last four are unused.

Next

Whew! That was a lot to cover!

If you’d like to download a complete copy of my deobfuscated and expanded version of the tiny image decoder to examine or play with, you can find it here.

Now that we’ve seen how all the different layers of decoding work, it should be just a small matter of programming to apply them in reverse to encode an image, right? I’ve done just that, and in a future post I will present my encoder in and explain the things that I learned while writing it. But for now, here’s a teaser file with an image containing some archways. The image is 512×326 pixels and I’ve compressed it down to under 4KB of data (for a mere 0.192 bits per pixel). See if you can recognize it! Try decoding it to a PPM image with:

./prog d < archways.bin > archways.ppm

Many thanks to Fabrice Bellard for reviewing an earlier draft of this post and reminding me about fully integer implementations of the DCT and IDCT.

Previous: More on Palettes