proj-oot-ootStringThoughts

Should strings be bytes or unicode?

There are three typical choices:

Our choice:

This makes it much harder for people who need to deal with non-Unicode characters, but slightly easier for everyone else. I expect that eventually Unicode will be fixed to add additional characters to make these ppl happy.

Error handling

Python has 3 available unicode encoding codec policies:

-- [1]

i've personally been burned in Python when trying to print out an error string and the error doesn't even get printed b/c it was being printed to an ASCII stream but it contained unicode. The Python answer to this is 'use repr, not print, when doing stuff that should always work, eg for debugging, but may not look nice'. But i don't think that's good. So, for oot, i think the default should be either 'xmlcharrefreplace' or 'replace', with 'strict' as an option. Probably 'xmlcharrefreplace' because that's good for debugging, although then you can't grep through output to see if there was a unicode error, so how about we mix the two? First insert the standard replacement character (the question mark with a diamond around it), and then insert an xmlcharref.

If someone wants to fuzz their code and have it globally go into 'strict' mode, there is a semiglobal environment variable that does that.

Internal string representation

Internally, we have 5 choices:

Choice: UTF-8

Compared to 'bytes with encoding tag', this is inefficient when dealing with large volumes of text in some other encoding, esp. if that text is only stored and then output in its native encoding and not processed, but we never said that Oot was efficient. If you really want to just store this stuff, just treat it as binary data rather than strings. UTF-8 throughout makes implementation simpler.

Names of functions

Not encode/decode, that's too hard to remember. bytes(), str() constructors (yes, the str() constructor takes an encoding arguments which in some cases are optional, but are either (a) required when converting from type 'bytes', or (b) optional there too, and read from some semiglobal default unless explicitly specified).

What is a 'character' in a string?

A unicode code point? Or grapheme cluster?

Note that the length of a concatenated string in grapheme clusters is not always equal to sum of the lengths of the substrings, because modifiers in Unicode apparently have non-zero length according to the standard.

Indices into strings are opaque identifiers, not integers, although they act like integers and can be cast to/from them. Internally they can be byte indices into the UTF-8 internal representation, with an assertion that they never point into the middle of a single grapheme cluster. Casting to/from integers forces a count of grapheme clusters (should there be a way to maintain this count rather than doing it all at cast time?).

See also:

Unicode in source code

Nope. Only ASCII.

The disadvantage is that there are many kids whose name and native language can't be rendered in ASCII and so who won't be able to write their name in the source code when learning to program, and who can't write 'Hello World' in their native language. My guess is that although this is a real hinderance, on the other hand many will be able to chalk this up to 'weird technical programming language stuff' and writing 'code'.

The advantage is that the implementation is simpler.

(Desktop profiles of) Oot will have a built-in facility to replace strings in the source code with strings from an accompanying internationalization file, so it won't be that hard for kids to make the program print their name, only somewhat cumbersome.

Regular expressions on Unicode

Here's a Unicode standard for Unicode regular expressions. It has 3 levels of functionality: level 1 is 'minimal' (code points as characters, and various property lookup tables hardcoded as regex character classes), level 2 (grapheme clusters as characters), and level 3 ('tailored' eg locale-specific stuff).

I suppose we might consider implementing level 1 if we can and even in theory level 2 but even levels 1 and 2 look like a ton of extra work unless the platform contains a unicode library that someone else wrote. Since Oot is supposed to be very multi-platform we can't really assume this is available. Perhaps we'll just implement the code points-as-character regexes (eg not even the level-1-required character classes). Or perhaps even this will be platform-specific. And as i've noted previously, the embedded profile won't even offer unicode.

---