Suppose we want to send a message to a friend using as few bits as possible… And let’s also suppose that this message consists of unit-normally-distributed real numbers. Now, it turns out that it’s really really expensive to send arbitrary precision real numbers: You could find yourself sending some not-previously-known-to-man transcendental number, and just have to keep listing off digits forever. So, in reality, we allow some amount of noise tolerance in the encoding.
This is a good problem to sink some effort into: Lots of data is roughly normally distributed, and if you’re sending some machine learned representation, you can introduce some regularization to ensure that the data is close to unit-normal. (This is, for example, the whole idea behind Variational Auto Encoders.)
I’ll start with a quick outline of a standard solution, and then we’ll go into a half-baked idea involving Fibonacci codes and space filling curves.
The basic idea comes in two parts:
- First, Fibonacci coding lets you encode an arbitrarily large integer, without having to send the length of the integer before hand. (There’s a bunch of similar schemes you can use here, but the Fibs are fun and interesting.)
- Second, map the integers into the reals in a way that covers the normal distribution in a ‘progressive’ way, so that we can use small numbers if we don’t need high precision, and only use bigger numbers as we need more precision.
At the end we’ll run a few empirical tests and see how well the scheme works, and point to where there’s room for improvement.
A Boring Solution.
A standard solution looks something like this: We know we’re roughly normally distributed, so let’s define some endpoints (beyond which we’ll clip), say, from -4 to 4, and then evenly divide that range into equally-sized cells with width (tol / 2). Then we just associate any number with the cell that it sits in. This changes the continuous data transmission problem into a discrete problem: We just need to send an identifier for the cell.
Bit encoding: An easy approach, then, is to just associate each cell with an integer, and send over the integers… If there’s 256 cells, we can use 8-bit integers, so we use 8 bits per value we want to send. (Note: this means our cell width is 8/256 = 1/32. And thus the tolerance is 1/64 ~= 0.016.)
We can do better though: Entropy coding (for which there are many schemes) lets us use less than 8 bits for the more common cells, and more than 8 bits for the least common cells. Since we said the data is normally distributed, we can save some bits this way.
According to Shannon’s coding theorem, an optimal code will use -log2(p) bits for a cell appearing with probability p. The expected cost of sending a number, then, is the sum over all cells of the probability of the cell multiplied by the negative log probability of the cell: -p * log2(p). Which, of course, is the entropy of the probability distribution on the cells.
Since we said the data is normally distributed, a cell’s probability is easy to compute using the CDF of the normal distribution: it’s just cdf(b) – cdf(a). For our example with 256 cells, we wind up needing a bit more than 7 bits per number sent, on average. (In practice, we need to actually turn the cells into /bits/, however, and will use more than the raw entropy score would indicate.)
Is that the end of the story?
Ok, all well and good… But it seems there’s maybe room for improvement.
First, there’s that nasty business of clipping at +/-4… We don’t /actually/ fulfill the tolerance contract, because there’s a very small probability of pulling an 8 from our normal distribution. It’s unlikely, but we’ll be in violation when it finally happens.
Second, all of our cells are the same size and we have a specific number of cells… Those aren’t requirements, though.
Third, it’s also often the case that we can let the tolerance change over time. In MP3 coding, the main trick is breaking the audio up into different frequency bands, and then making somewhat-complicated decisions about the tolerance for each band, mostly based on how loud the band is compared to all the other bands. So, in such a scheme, you need a way to dynamically change precision over time. This generally involves sending some extra information.
Well, here’s an alternative scheme that tries to address a bunch of these at once, using a universal code and a space filling curve.
The first tool we’ll use is Fibonacci Coding. (Actually, any universal code will do. I just happen to be partial to the Fibs.) The Fibonacci Code is a way to write down any positive integer as a bit sequence with no adjacent 1’s, except at the very end.
For some intuition, you can think of decimal numbers like 65 as 6 * 101 + 5 * 100. Likewise, in boolean 65 = 1000001, which is 1 * 64 + 1. You can write any number as a sum of powers of two: the boolean representation just flips ‘on’ the bits for all the powers of two which appear in the sum.
Fibonacci encoding is just like the boolean encoding, but using Fibonacci numbers instead of powers of two. This works because as we get bigger Fibonacci’s, they grow exponentially according to the golden ratio. So the Fibonacci encoding is a bit using like “Base Golden-Ratio” instead of Base 2.
And the really nice property of this encoding of integers is that I don’t need to tell you how long the integer is before I start sending it! All integers end with a special ’11’ symbol, but otherwise never contain ’11’ symbols.
Space Filling Curve, or (really) Line Filling Dots.
As long-time readers of this blog will know, a space-filling curve is a fractal that fills some space. The Hilbert Curve is an old favorite. In my case, I’m really going to want something a bit lower-dimensional: A way to map all of the integers into the real numbers in a ‘dense’ way, so that for any real number and tolerance I choose, i can find some integer close enough to the chosen real number. Then, to send the real number at some particular tolerance, I just find a close-enough integer, convert it to the Fibonacci code, and send it across the wire.
Those of you who know about dense sets will know that for any tolerance you choose, there will end up being infinitely many integers within the chosen tolerance. So really, we want to find the cheapest (ie, smallest) integer within the chosen tolerance.
But we still need a smart way to map all of the integers into the real line. A really good choice will have the property that we usually find a small integer close by… The smallest integers should then be spread out over a fairly wide range, so that when the tolerance is high, we can easily find a small integer nearby, and (thus) a cheap encoding. And then bigger integers should progressively fill in the gaps, but similarly be ‘readily available’ for medium tolerance encodings.
But how to place these integers concretely? Well, we’re looking to minimize mean distortion, so we can proceed greedily… Place the first point at 0. Now we have two infinite regions where we would like to place the next points.
But where? We can simply try to find the point in each region that minimizes the mean distortion, were we to code with all of the points chosen so far. This process doesn’t seem to have a nice analytical solution, but it’s not toooo hard to code up a numerical approximation. Here’s what it looks like:
The pattern is pretty, but it has the drawback of being pretty slow to compute where these points actually are. This would be a problem for actual use: one could use a look-up table for the first few thousand points, and then, if you ever need something further, need to expend a lot of effort finding additional points. A closed form approximation would be desirable.
To code a number x with tolerance tol, we find the index of the ‘cheapest’ (ie, earliest) point within the target tolerance. We then find the Fibonacci code for the index, and send those bits across the wire. The decoder receives the Fibonacci number, and then finds the point corresponding to it in their own big list of points.
It’s pretty easy to code this up and run some small experiments to poke at the quality of the coding.
Let’s set a tolerance of 1/2k and sample a few thousand numbers from a normal distribution, then turn each into one of our Fibonacci numbers, and then count the mean number of bits per coded number. For tolerance 1/64, we end up needing around 9.4 bits per number sent. Each halving of the tolerance seems to cost an additional 1.2-1.3 bits: with tolerance 1/128, we use 10.7 bits per number, and at tolerance 1/256, we use 12.08 bits per number sent.
Is this any good? Well, it’s not /great/. The boring solution for tolerance 1/64 has entropy indicating about 7 bits per number. So, the Fibonacci encoding scheme looks really suboptimal. But in computing the entropy, we didn’t /actually/ code the bins into binary. There’s always some loss due to the need to pack the probability distribution into bits. If we code the bins with Fibonacci coding, we end up needing just over 8 bits. Going through and trying many more tolerances, it seems to cost around an extra 1.2 bits per halving of the tolerance, which is actually quite close to the growth when using the adaptive-rate encoding: This growth rate probably has more to do with the choice of the Fibonacci coding scheme than the choice of where we place our quantization cells.
So, backing up, there’s two parts to the system:
- A distribution scheme, which is an ordered choice of points on the real line, each corresponding to a quantization cell, and
- An encoding scheme, where we map an arbitrary positive integer to bits.
We’ve looked at two distribution schemes (evenly spaced, and greedily minimizing distortion). The greedy distortion approach seems to eventually cost around 1 bit more per value encoded at a given tolerance than the evenly spaced strategy in practice. On the other hand, the greedy approach gives us arbitrary precision. We expect that flexibility to come at a cost: In this case, it seems the cost is only 1 bit, which is pretty cool! Normally we would expect to send some extra info indicating the intended tolerance to the decoder, which will usually take more than 1 bit.
In the main construction, we have a space-filling strategy which gets progressively better resolution the ‘deeper’ you go. We paired this with a Fibonacci encoding, which allows us to indicate an arbitrarily ‘deep’ value in the space-filling points without needing to communicate up front what tolerance we’re planning to use.
On the whole, it seems like a pretty cheap way to get arbitrary precision, with no extra side information to indicate the ‘intended’ tolerance.
Of course, there are better coding schemes than the Fibonacci coding for single values with a fixed, finite alphabet… A better comparison would use Huffman codes for both the cells and the tolerance. But this is just the tiniest of side quests, so I’ll leave off from here.
If you want to play with this further, here’s some room for expansion:
- Create a better scheme for finding the distortion-minimizing points than (slow) numerical approximation.
- Do a proper comparison of Fibonacci coding against a Huffman coding (or similar).
- Make a better comparison against a ‘traditional’ adaptive tolerance scheme, where we would (say) code the tolerance and the quantization cell separately.
- Try out a Tribonacci encoding (which avoids the pattern 111, instead of 11) instead of the Fibonacci encoding. The Tribonacci ratio is about 1.8, rather than 1.6 for the Golden Ratio, so it should be a liiiiittle bit more bit-efficient than the Fibonacci encoding, at the cost of one extra bit per value sent.
- Or try some other universal code… Like Golomb.
And, since this is just a fun blog post, I’ll leave off from here. Cheers!