Table of Contents for Programming Languages: a survey

Chapter: encoding

Instruction encoding

Fixed or variable length

The various syntactic constructs of a target language program can be of fixed or variable length.

An advantage of fixed-length encodings is that the instructions can be decoded in parallel.

Variable length encodings

For constructs with variable-length encodings, a way of finding the end of each construct must be provided. The possibilities and tradeoffs are similar to string length/termination schemes, and particularly to the tradeoffs between C-style strings and Pascal-style strings (see , below), except:

the language can have a hierarchical/recursive syntactic structure, meaning that it may be possible for variable-length length syntactic constructs to be composed of smaller fixed-length parts, or vice versa.
- this leads to interesting implications on the other advantages and disadvantages; for example, if one had terminator sentinel values to mark the end of the higher-level, containing syntactic contstruct, but not for the smaller constructs contained within it, this would only inhibit parallel decoding within each higher-level construct, not between them; and the chain of serial dependencies would likely be short, allowing semi-parallel decoding even within it
in many cases the length of variable length syntactic constructs can be inferred without spending space on a length header (e.g. in most variable-length assembly languages, the length of an instruction is inferred from its opcode)
since we are talking about program and not data, the language designer can specify various sentinel values and make sure no other instructions are equal to or possible even that no other instructions contain those values. Compare to the case with strings, where C-terminated strings cannot store binary data without escaping, which is a real handicap; by contrast, if an assembly language designer wanted to reserve opcode 0 for some sort of sentinel or terminator, it would be easy for them to simply not use opcode 0 for any other instruction.

Data encoding

String length/termination schemes

The two most popular schemes are 'Pascal-style' strings (each string has a fixed-length header (two bytes is a popular choice) that specifies the string length) and 'C-style strings' (each string is termination by a zero byte, or 'NULL').

Pascal-style strings have the disadvantages that:

they have a maximum length
if you are just looking at the bitstream without knowing which bytes are the string header, you don't know where the string ends
- this means that if you have a one-off error and lose your position in the bitstream, you will not be able to recover
- this also means that you cannot have a bunch of decoders looking at different parts of the encoded data and parse it in parallel; there are serial dependencies between each data item and the previous one, because you must know the length of the previous one before you know where the next one starts
if the header length is more than one byte, then it will waste space for strings of length smaller than 256

C-style strings have the disadvantages that:

there is no constant-time way to determine the string-length
- this makes some string operations that are O(n) less efficient, because sometimes there is a way to do the inner loop more efficiently but only if you know the length ahead of time, and for an O(n) operation it is often not worth it to walk thru the entire string before starting the real work
the string cannot store a NULL (without escaping); this means that the language must treat binary data separately from strings
they are prone to programming mistakes that can lead to security bugs, involving inserting NULLs into strings or buffer overflows
technically speaking, UTF-8 can contain NULL bytes and so cannot be stored unescaped as a C-style string. For this reason, many languages use Modified UTF-8.

Pascal-style strings can be extended by various encoding schemes that allow arbitrary lengths to be encoded into a variable-length header. For example:

the header itself can be null-terminated
the header could use https://en.wikipedia.org/wiki/Elias_gamma_coding , that is, the header can consist of two parts, a unary count of digits, and a binary representation of length whose number of digits is specified in the first part
the header could use some other prefix-free code such as the ones in https://en.wikipedia.org/wiki/Prefix-free_code#Techniques_2