Table of Contents for Programming Languages: a survey
Chapter: encoding
Instruction encoding
Fixed or variable length
The various syntactic constructs of a target language program can be of fixed or variable length.
An advantage of fixed-length encodings is that the instructions can be decoded in parallel.
Variable length encodings
For constructs with variable-length encodings, a way of finding the end of each construct must be provided. The possibilities and tradeoffs are similar to string length/termination schemes, and particularly to the tradeoffs between C-style strings and Pascal-style strings (see , below), except:
- the language can have a hierarchical/recursive syntactic structure, meaning that it may be possible for variable-length length syntactic constructs to be composed of smaller fixed-length parts, or vice versa.
- this leads to interesting implications on the other advantages and disadvantages; for example, if one had terminator sentinel values to mark the end of the higher-level, containing syntactic contstruct, but not for the smaller constructs contained within it, this would only inhibit parallel decoding within each higher-level construct, not between them; and the chain of serial dependencies would likely be short, allowing semi-parallel decoding even within it
- in many cases the length of variable length syntactic constructs can be inferred without spending space on a length header (e.g. in most variable-length assembly languages, the length of an instruction is inferred from its opcode)
- since we are talking about program and not data, the language designer can specify various sentinel values and make sure no other instructions are equal to or possible even that no other instructions contain those values. Compare to the case with strings, where C-terminated strings cannot store binary data without escaping, which is a real handicap; by contrast, if an assembly language designer wanted to reserve opcode 0 for some sort of sentinel or terminator, it would be easy for them to simply not use opcode 0 for any other instruction.
Data encoding
String length/termination schemes
The two most popular schemes are 'Pascal-style' strings (each string has a fixed-length header (two bytes is a popular choice) that specifies the string length) and 'C-style strings' (each string is termination by a zero byte, or 'NULL').
Pascal-style strings have the disadvantages that:
- they have a maximum length
- if you are just looking at the bitstream without knowing which bytes are the string header, you don't know where the string ends
- this means that if you have a one-off error and lose your position in the bitstream, you will not be able to recover
- this also means that you cannot have a bunch of decoders looking at different parts of the encoded data and parse it in parallel; there are serial dependencies between each data item and the previous one, because you must know the length of the previous one before you know where the next one starts
- if the header length is more than one byte, then it will waste space for strings of length smaller than 256
C-style strings have the disadvantages that:
- there is no constant-time way to determine the string-length
- this makes some string operations that are O(n) less efficient, because sometimes there is a way to do the inner loop more efficiently but only if you know the length ahead of time, and for an O(n) operation it is often not worth it to walk thru the entire string before starting the real work
- the string cannot store a NULL (without escaping); this means that the language must treat binary data separately from strings
- they are prone to programming mistakes that can lead to security bugs, involving inserting NULLs into strings or buffer overflows
- technically speaking, UTF-8 can contain NULL bytes and so cannot be stored unescaped as a C-style string. For this reason, many languages use Modified UTF-8.
Pascal-style strings can be extended by various encoding schemes that allow arbitrary lengths to be encoded into a variable-length header. For example: