Converting Color Depth

Recently I wrote a program to perform image processing on a PNG image. Given that the input represented color in each channel with an unsigned byte and I wanted to process it as a floating point value in the 0.0 to 1.0 range, I wrote the code to divide by 255.0 on input and then multiply by 255.0 and truncate on output. This mostly works and seems to be the usual approach that people take. I’ve done it many times before but it’s always left me a bit uneasy. Is there a better way? I think there is.

Expressed in code, the essence of the conversions was this:

unsigned char u8 = read_input();
float f32 = u8 / 255.0f;
// ...
// Image processing on f32
// ...
u8 = (int) ( 255.0f * f32 );
write_output( u8 );

Note that I’m not considering gamma, color spaces, color systems, gamuts, or any such deep topics here, just converting between 8-bit integer and 32-bit floating point color depths. I’ve also omitted clamping the final integer to the 0 to 255 range here since I didn’t require it.

The Problem

Why do I dislike this? There’s quite a few reasons. First, the divisor and multiplier is 255. We have 256 different steps that can be expressed in an unsigned byte, not 255, so it feels like we should be using 256 instead. The fraction 1/255 can’t be exactly represented as a floating point number which means that any value for the input except 0 or 255 maps to an inexact floating point number. Granted, any byte of input will normally round-trip back to the same byte for output thanks to the magic of floating point rounding but that can be a finicky thing and I’d rather not have to rely on it.

Secondly, there’s an asymmetry here. The image processing step could add small values, up to but not including the fraction 1/255 without affecting the output. However, the smallest decrease will shift the output byte down a step. There’s a bias towards black with this approach and repeated image processing steps on a file are likely to lose energy and darken over time.

Thirdly, while this works for floats, it doesn’t generalize particularly well to converting to other fixed-point depths. Consider converting between 8 bits and 10 bits per channel. The closest analogues to the byte and floating point conversions would look like this:

u10 = u8 * 1023 / 255;
u8 = u10 * 255 / 1023;

This maps 0 to 0 and back, and it maps 255 to 1023 and back. But most of the intermediate values do not round-trip properly.

What are some alternatives? We could use 256 as the divisor and multiplier instead. This solves the problem of inexact floating point values and generalizes for other fixed-point depths, but the asymmetry will remain. Another option is to round instead of truncate. This is better and avoids some of the concerns about the asymmetry but it seems a little weird that any floating point value between the fractions -1/510 and 511/510 will map to a correct byte without clamping.

What Are the Coordinates of a Color?

In terms of spatial coordinates, Heckbert argued for a dualist view were a pixelated image is created by point sampling a continuous function at the pixel centers. This requires a mapping between the discrete coordinates such as the pixel coordinates in a 2D array and continuous coordinates. Of the various possible mappings, the one that he recommended (where c was the continuous coordinate and d was the discrete coordinate) was:

d = floor(c)

c = d + 0.5

What if we apply this idea to colors? After all, most color spaces are continuous and an 8-bit per channel RGB color is just a discrete point within a continuous color cube. By analogy to the pixel coordinates mapping, our color coordinate mapping would be:

f32 = ( u8 + 0.5f ) / 256.0f;
u8 = (int) ( 256.0f * f32 );

This has some nice properties. For one, we now use 256 as the divisor so the bytes map to exactly representable fractional values in floating point. The asymmetry is also gone now: for each floating point value converted from an input byte we can add any number between the fractions -1/512 and 1/512 (exclusive) while still mapping back to the same byte. Also, the range of floating point values that maps to 0 to 255 without clamping is now exactly 0.0 to 1.0. Lastly, it also works cleanly for mapping between different fixed-point depths. For example, for 8 bits and 10 bits the analogue is:

u10 = 4 * u8 + 2;
u8 = u10 / 4;

The only real downside is that the darkest value in the lower precision representation no longer maps to the darkest value in the higher precision representation, and similarly for the lightest values. This makes some sense, however. The original value may not have been the actual darkest or lightest either, but merely got quantized to it. The slightly higher and lower values, respectively, represent this inherent uncertainty. They are still the best guess for what the original value might have been before quantization.

Another great benefit to this mapping is that it works very nicely with dithering. We can add a small random value before output (note that this now requires clamping):

u8 = (int) ( 256.0f * f32 + drand48() - 0.5f );
write_output( u8 );

This bit of dithering can help break up banding where our image processing step changed things, but has no effect where it didn’t. If there was no image processing at all, we’ll simply get the original image back byte-for-byte! Another possibility is to use the same dither for all channels, e.g.:

float dither = drand48() - 0.5f;
r_u8 = (int) ( 256.0f * r_f32 + dither );
g_u8 = (int) ( 256.0f * g_f32 + dither );
b_u8 = (int) ( 256.0f * b_f32 + dither );
write_output( r_u8, g_u8, b_u8 );

This gives dithering where the image changed and the original colors where it didn’t, but now the dither avoids color shifts. Greys will be dithered but will remain pure grey.

So that’s how I plan to convert between different color depths from now on. I hope that I’ve convinced you to do the same.