Bayle Shanks's website: proj-oot-ootStringThoughts

Should strings be bytes or unicode?

There are three typical choices:

(1) strings are arrays of bytes, like languages used to do
(2) strings are unicode (arrays of grapheme clusters? arrays of codepoints? something else?), and encoding/decoding is done at I/O boundaries
(3) strings are pairs of (encoding tag, bytes). This allows non-Unicode characters to be represented (due to Han unification some people's asian names or place names are wrong in unicode, apparently) (Ruby 1.9 does this)

Our choice:

(2) Unicode, with encoding and decoding at I/O boundaries. However, embedded profiles may not want to support that, so some profiles may only support ASCII; in these profiles the unicode encoding/decoding functions don't really do anything.

This makes it much harder for people who need to deal with non-Unicode characters, but slightly easier for everyone else. I expect that eventually Unicode will be fixed to add additional characters to make these ppl happy.

Error handling

Python has 3 available unicode encoding codec policies:

strict: throw an error
replace: "give me a standard replacement character", eg a question mark with a diamond around it
xmlcharrefreplace: "produces an HTML/XML character entity reference, so that \u01B4 becomes "ƴ"

-- [1]

i've personally been burned in Python when trying to print out an error string and the error doesn't even get printed b/c it was being printed to an ASCII stream but it contained unicode. The Python answer to this is 'use repr, not print, when doing stuff that should always work, eg for debugging, but may not look nice'. But i don't think that's good. So, for oot, i think the default should be either 'xmlcharrefreplace' or 'replace', with 'strict' as an option. Probably 'xmlcharrefreplace' because that's good for debugging, although then you can't grep through output to see if there was a unicode error, so how about we mix the two? First insert the standard replacement character (the question mark with a diamond around it), and then insert an xmlcharref.

If someone wants to fuzz their code and have it globally go into 'strict' mode, there is a semiglobal environment variable that does that.

Internal string representation

Internally, we have 5 choices:

bytes with encoding tag
UTF-8
UTF-16
some weird mix, like Python

Choice: UTF-8

Compared to 'bytes with encoding tag', this is inefficient when dealing with large volumes of text in some other encoding, esp. if that text is only stored and then output in its native encoding and not processed, but we never said that Oot was efficient. If you really want to just store this stuff, just treat it as binary data rather than strings. UTF-8 throughout makes implementation simpler.

Names of functions

Not encode/decode, that's too hard to remember. bytes(), str() constructors (yes, the str() constructor takes an encoding arguments which in some cases are optional, but are either (a) required when converting from type 'bytes', or (b) optional there too, and read from some semiglobal default unless explicitly specified).

What is a 'character' in a string?

A unicode code point? Or grapheme cluster?

Note that the length of a concatenated string in grapheme clusters is not always equal to sum of the lengths of the substrings, because modifiers in Unicode apparently have non-zero length according to the standard.

Indices into strings are opaque identifiers, not integers, although they act like integers and can be cast to/from them. Internally they can be byte indices into the UTF-8 internal representation, with an assertion that they never point into the middle of a single grapheme cluster. Casting to/from integers forces a count of grapheme clusters (should there be a way to maintain this count rather than doing it all at cast time?).

Unicode in source code

Nope. Only ASCII.

The disadvantage is that there are many kids whose name and native language can't be rendered in ASCII and so who won't be able to write their name in the source code when learning to program, and who can't write 'Hello World' in their native language. My guess is that although this is a real hinderance, on the other hand many will be able to chalk this up to 'weird technical programming language stuff' and writing 'code'.

The advantage is that the implementation is simpler.

(Desktop profiles of) Oot will have a built-in facility to replace strings in the source code with strings from an accompanying internationalization file, so it won't be that hard for kids to make the program print their name, only somewhat cumbersome.

Regular expressions on Unicode

Here's a Unicode standard for Unicode regular expressions. It has 3 levels of functionality: level 1 is 'minimal' (code points as characters, and various property lookup tables hardcoded as regex character classes), level 2 (grapheme clusters as characters), and level 3 ('tailored' eg locale-specific stuff).

I suppose we might consider implementing level 1 if we can and even in theory level 2 but even levels 1 and 2 look like a ton of extra work unless the platform contains a unicode library that someone else wrote. Since Oot is supposed to be very multi-platform we can't really assume this is available. Perhaps we'll just implement the code points-as-character regexes (eg not even the level-1-required character classes). Or perhaps even this will be platform-specific. And as i've noted previously, the embedded profile won't even offer unicode.

---