proj-plbook-plChStrings

Table of Contents for Programming Languages: a survey

Chapter : Strings

immutable vs. mutable

keywords

internationalization

separating strings from source code files

unicode intro encode, decode cannot assume one byte per character

unicode in operations

unicode in source code files

UTF-8 and ASCII, esp. in source file

Many older programming languages treat strings the same as sequences of bytes.

Many programming language used to treat strings as arrays of bytes, but moved to treating them as sequences of Unicode. For example, Python made this change between versions 2 and 3. In this approach, generally the programmer only has to think about encodings at I/O boundaries [1], although they must still be aware of Unicode paradoxes such as that the length of a concatenated string is not always equal to sum of the lengths of the substrings. Compared to the older "strings are the same as sequences of bytes approach", this causes more work when a "string's" encoding is in fact unknown, in which case the string must be treated as a sequence of bytes anyways; this is apparently the case with filenames and many other things in Unix [2].

Ruby 1.9 changed to treat strings as a tuple (encoding, sequence of bytes) [3]. This allows strings to contain even non-Unicode characters, at the expense of forcing the programmer to be aware of encodings when doing operations on strings.

Many programming languages/implementation of programming languages use UTF-8 as the internal encoding of strings. This can pose problems if the programmer needs to work with external string data with an encoding that cannot be roundtripped into Unicode (as of this writing, CP932 aka Windows-31J. Ruby "stores Strings as the original sequence of bytes, but allows a String to be tagged with its encoding" in order to allow programmers to work with these other encodings.

Windows and Javascript use UTF-16 internally [4]. UTF-8 is popular on the WWW [5].

Modified UTF-8 is a variant of UTF-8 in which the null character, codepoint U+0000, is encoded as 0xC0,0x80, leaving 0 available for null-termination etc.

Many Windows programs "add the bytes 0xEF, 0xBB, 0xBF at the start of any document saved as UTF-8" [6].

Links:

regular expressions

C vs. Pascal style vs. linked lists

haskell strings as linked lists; need for ByteString?; unicode

(todo: should this be moved to the 'implementation' part of the book?)

Links:

EOL and EOF

Unix vs. Windows EOL characters

newline at end of file (trivia: the C standard requires one)

formatting

sprintf-style

Python's str.format

Python's Literal String Interpolation

todo

https://utcc.utoronto.ca/~cks/space/blog/programming/CNullStringsDefense?showcomments

discussion on low-level string representations: https://www.reddit.com/r/C_Programming/comments/nqkn93/comment/h0c6kt2/