notes-computer-variableLengthSelfDescribingEncodingsOfArbitraryIntegers

Let's say that we are transmitting information via a bitstream. We can choose the encoding format. The first item to be encoded into this bitstream is going to be an encoding of an integer. The integer may be arbitrarily large. The encoding needs to have a way of letting the reader know when we are done with the integer and the next item is coming.

See also http://en.wikipedia.org/wiki/Variable-length_code .

Unary

One way to do this is unary encoding with a sentinel of 0:

0 10 110 1110 11110

the problem with this is that it takes many bits to code large numbers.

Other constant sentinel-based codings

One way is to have a sentinel which will act as a terminator, and to escape this value if it occurs elsewhere. For example, we could make '00' our sentinel, and use a (0,1) run-length-limited code (see http://en.wikipedia.org/wiki/Run_length_limited ) to somehow 'escape' the rest of the bits of the number so as to avoid 00. Clearly (i think), the sentinel must be longer than 1 bit or else we are back to unary.

Now we can use the bits before the sentinel (after de-escaping) similarly to bits in an ordinary binary encoding, except we can also use the lone sentinel and possibly corner cases in the escaping system to carry information.

e.g.: with 00 as the sentinel, put 'clock bits' of 1 in between each actual bit:

0: 00 1: 100 2: 0100 3: 1100 4: 010100 5: 011100 6: 110100 7: 111100

Conditional termination codings

Instead of having a constant sentinel value, you could have any sort of complicated rule, and then come up with some 'escaping' system to prevent the encoding of a binary representation from ever meeting this rule.

Composite encodings

We can take either of the previous systems and use them to encode single numbers, and then combine a sequence of these numbers (possibly with different encoding systems) to make another number.

For example, http://en.wikipedia.org/wiki/Golomb_coding has two parts, a quotient and a remainder. The quotient is unary-coded, and the remainder in truncated binary.

Geometric distribution condition

Now what if the larger numbers are much less probable than the smaller ones, in the sense that we rarely will want to encode a large integer but frequently will want to encode a small one? Now we care very much about a short encoding length for small integers, even at the expense of a longer encoding for large integers.

At the extreme, this pushes us to unary encoding. Clearly (i think), if you need to represent arbitrarily large numbers, you can't do better than 1 bit for 0, and then if you have a scheme that does 1 bit for 0 and your next priority is having the shortest representation for 1, you can't do better than 2 bits for 1, and then if you have a scheme that does 1 bit for 0 and 2 bits for 1 and your next priority is having the shortest representation for 2, you can't do better than 3 bits for 2, etc.

But this does seem rather wasteful for large numbers.

One special case of this criterion is if we have a geometric distribution for the size of the number that we want to encode. http://en.wikipedia.org/wiki/Golomb_coding seems to say that Golomb coding is optimal for this.

Another scheme

Have the code in two parts. The first part is unary, the second part is binary. The total length is always a power of two. The length of the binary part is whatever is needed to 'fill out' to the next power of two.

0 10 1100 1101 4 1110 11110000 11110001 11110010 11110011 11110100 11110101 11110111 11111000 11111001 11111010 11111011 11111100 11111101 18 1111111100000000 1111111100000001 20 ...

this is better than the 00-sentinel-based encoding above for 0 and 1, and better than unary for 4 and any number above 8, but it's worse than unary for 2.

One attribute of these scheme is that the encoding lengths are always powers of two.

Another scheme #2

Two parts. A unary part followed by a binary part. The unary part encodes the length of the binary part.

0 100 101 11000 4 11001 11010 11011 1110000 8 1110001 1110010 1110011 1110100 1110101 1110110 1110111 111100000 16

this appear to be similar or identical to https://en.wikipedia.org/wiki/Elias_gamma_coding

Schemes with special cases for small numbers

One could just adopt unary for small number and then switch schemes at some preset value.

0 10 110 -- switch to 'another scheme', above 1110 11110000 etc

Fixed-length length header with maximal value indicating concatenation

One can have a fixed-length 'header' which encodes the length of the rest. If the header encodes the maximum length, that is a signal that the length is actually longer than can be encoded in this, and after you get to the maximum length you will find another header field that gives the length of the next segment.

This has a little endian and big endian version.

This method is directly suited for encoding binary strings, not integers (there is a different code for 00 and 000) but one can imagine variants that remove this redundancy for integer encoding.

E.g. if the header length is 2, and we are using binary and big endian, then some examples are:

encoded header fields broken out binary string 00 00: 010 01:0 0 011 01:1 1 1000 10:00 00 1001 10:01 01 1010 10:10 10 1011 10:11 11 1100000 11:000 00: 000 1100100 11:001 00: 001 1111100 11:111 00: 111 11000010 11:000 01:0 0000 11000011 11:000 01:1 0001 110001000 11:000 10:00 00000 110001001 11:000 10:01 00001 110101010 11:010 10:10 01010 110001100000 11:000 11:000 00: 000000 1100011000010 11:000 11:000 01:0 0000000 1100011000011 11:000 11:000 01:1 0000001 11100111001001 11:100 11:100 10:01 10010001

Fixed-length length header with maximal value as unary count, indicating indicating header length in the manner of unary ellias gamma encoding

One can have a fixed-length 'header' which encodes the length of the rest. If the header encodes the maximum length, that is a signal that the fixed length is not enough space for the header; the number of maximal length headers forms an unary count which is multiplied by the fixed length added to the original fixed length to give the length of the actual header, which follows.

E.g. if the header length is 2, then some examples are:

encoded header fields broken out binary string 00 00: 010 01:0 0 011 01:1 1 1000 10:00 00 1001 10:01 01 1010 10:10 10 1011 10:11 11 110011010 11:0011:010 010 1101001010 11:0100:1010 1010 11110100001111000011110000 1111:010000:0000111100001111 0000111100001111

Fixed-length length header with maximal value flagging beginning of null-terminated, RLL-encoded binary number, indicating header length in the manner of unary ellias gamma encoding

E.g. if we use the 'every-other-bit-is-a-one-clock-bit, then two zeros terminate' RLL, we get:

encoded header fields broken out binary string 00 00: 010 01:0 0 011 01:1 1 1000 10:00 00 1001 10:01 01 1010 10:10 10 1011 10:11 11 1101011110010 11:0011:010 010 11011101001010 11:0100:1010 1010 110111010101001111000011110000 1111:010000:0000111100001111 0000111100001111

Fibonacchi encoding

https://en.wikipedia.org/wiki/Fibonacci_coding

This is sort of a special RLL (run-length limited) code in which 11 doesnt appear until the end.

Elias omega coding

https://en.wikipedia.org/wiki/Elias_omega_coding

This is the obvious recursive generalization of Elias gamma coding.

Ternary coding

A little irrelevent to this page but interesting because it has the greatest radix economy (https://en.wikipedia.org/wiki/Radix_economy).

Variable radix counting systems

just some notes

0 1 prime #1 is 2 prime #2 is 3 2^2 = 4 prime #3 is 5 2*3 = 6 prime #4 is 7 2^3 = 8 3^3 = 9 2*5 = 10 prime #5 is 11 3*4 = 12

what i am trying to get at is, if you had a positional number system with a variable radix, where the radixes were the primes, then what would the symbols be for the exponents of those primes? you'd have to use the same system, but how would you start?

0 1 10 100 or 11 (10)0 or 101 1000 or 110 1001 or 111 10000 or 1010 10001 or 1011 or (100)0 10010 or (100)00 10100 or (10)000

seems like you may be able to get by with just 1s and 0s, is this a theorem?

Trinary RLL over binary

Each pair of binary digits encodes one of four possibilities:

00: trinary 0 01: trinary 1 10: trinary 2 11: STOP

now encode the value you want in trinary, and use the 11 to terminate it at the end.

Generalized variable-length integers

https://en.wikipedia.org/wiki/Numeral_system#Generalized_variable-length_integers

i dont understand this but it looks interesting.

Generalized variable-length integers