proj-oot-old-151220-ootStringNotes1

---

andrewstuart 5 hours ago

IMO one of the reasons for all the ((with Python unicode)) angst is that .encode() and .decode() are so ambiguous and unintuitive which makes them incredibly confusing to use. Which direction are you converting? From what to what? The whole unicode thing is hard enough to understand without Python's encoding and decoding functions adding to the mystery. I still have to refer to the documentation to make sure I'm encoding or decoding as expected.

I think there would have been much less of a problem if encode and decode were far more obvious, unambiguous and intuitive to use. Probably without there being two functions.

Still a problem of course today.

reply

 twoodfin 3 hours ago

Indeed, the Python 3 Unicode string object is fascinatingly clever. Code worth reading:

https://github.com/python/cpython/blob/master/Objects/unicod...

reply

nostrademons 3 hours ago

Also, it's incompatible with UTF-8 strings stored in C, which means that when you cross the Python/C API boundary, you have to re-encode all strings. This is a large performance penalty right at the time when you can least afford performance penalties.

IMNSHO, most modern languages should be storing strings as UTF-8 and give up on random access by characters. You almost never need it; in the most frequent case where you do (using indexOf or equivalent to search for a substring, and then breaking on it), you can solve the problem by returning a type-safe iterator or index object that contains a byte offset under the hood, and then slicing on that. Go, Rust, and Swift have all gone this route.

reply

shoyer 2 hours ago

This design doc from DyND? (a possible NumPy? alternative) has some useful references on this point: https://github.com/libdynd/libdynd/blob/master/docs/string-d... https://github.com/libdynd/libdynd/blob/master/docs/string-design.md#code-unit-api-not-code-point

reply

my NOTE: the linked document suggests having indices in units of bytes rather than code points. See discussion supporting this sort of thinking below, which argues that there is no consensus on what is and isnt a codepoint, that string concatenation and string length addition aren't homomorphic, and that therefore you can't really have indices in code points at all, so the next best option is to have 'opaque indices' of some sort, which then may as well be byte indices under the hood. (further note: https://news.ycombinator.com/item?id=10754587 mentions that the unicode standard does define something called a 'grapheme cluster')

hahainternet 4 hours ago

    In [1]: len("नि")
    Out[1]: 2

So close, but it's not really a string or bytes, but a codepoint array. You can even iterate it:

    In [2]: for c in "नि":
       ...:     print("Character {}".format(c))
       ...:     
    Character न
    Character ि

reply

dietrichepp 4 hours ago

I'm getting tired of this argument, we had it in the Rust community as well and I was tired of it there, too. I've seen the same argument over and over again, and every argument had the same hole that I see in your post: What, exactly, is a "character"? Please answer me that, and then you can write a proposal for how the API should work, exactly, and why this works with Indic and German and Korean and everything else.

Strings have to be arrays. It is inevitable. We are on Von Neumann machines with L1 cache lines that are something like 64 bytes wide, so making strings into something other than arrays is a complete non-starter. Because strings are arrays, we use array indexes to slice strings. The indexes are integers, because that just makes sense on a Von Neumann machine.

So your complaint is that you can slice a string in "bad" ways. What makes it bad? You're trying to prevent people from slicing up grapheme clusters, which is a noble goal. But in practice, we get the indexes from methods like str.find() or re.match(), and we treat them as more or less opaque, so we don't end up slicing grapheme clusters very often in practice. Grapheme cluster slicing requires a big Unicode table in memory anyway, so in the rare case that you need it, you can pay the high performance cost for using it. In the meantime, formats like JSON and XML are defined in terms of code points, so using code point arrays eliminates a class of bugs where you could accidentally make malformed XML or JSON, which would then get completely REJECTED by the receiving side, causing your Jabber client to quit or your web browser to show a bunch of mojibake.

And let me ask you this: what do I get when I write:

    x = "a"
    y = "\u0301"

Is the resulting len(x + y) == 1? But len(x) == 1 and len(y) == 1? Can you tell me what the correct behavior is? Might you end up with bugs in programs because len(x + y) != len(x) + len(y)? Or do you introduce extra code points into the string when you concatenate them?

Please, tell me what you actually think is correct behavior. It is far, far more useful than pointing out that something is "wrong" on purely semantic arguments.

reply

anon1385 4 hours ago

The correct answer is that strings shouldn't have length methods. There is no such thing as the 'length' of a string because as you pointed out there is no clear definition of what a 'character' is.

https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-s...

reply

the_mitsuhiko 3 hours ago

To add to that: O(1) indexing into strings outside of the ASCII range is completely pointless and makes absolutely no sense. A language should disallow that and not encourage it.

reply

Veedrac 2 hours ago

I disagree with it being pointless. Being able to get indices into a string for fast slicing is useful. Consider Regex where references get stored to slices in the array.

What you can drop is the idea of the index being an integer with any particular semantic meaning. Rust uses byte indexes with an assertion that the index lies on a codepoint boundary, for instance.

reply

lucozade 3 hours ago

> Because strings are arrays, we use array indexes to slice strings

But that's a choice and not the only one. Sure you're likely to implement your Unicode string as an array of some type. But that doesn't mean that the only sensible approach is to expose that array directly to the user. Or more specifically as the primary interface.

For my money, Swift has probably the most comprehensive Unicode string API I've seen [0]. What they do is, essentially, have an opaque String type but support various "views" as properties. The main view is called .characters and represents a collection of grapheme clusters. They also have properties that present Unicode scalars, utf8 encoding etc.

Their API is complex, no question. But then proper handling of Unicode is complex. But it does show that there are other options than simply exposing the Unicode scalars.

BTW to your point on len(x) + len(y). The answer is 2 if you define len() on Unicode scalars but 1 if you define it on characters. Why? because len("\u301") should be 0. It's not a character, it's a base modifier. It is, of course, true that getting this right is likely to be substantially more expensive than getting it wrong but that doesn't mean it can't be done.

[0] https://developer.apple.com/library/ios/documentation/Swift/...

reply

klodolph 2 hours ago

len("\u0301") == 0 is extremely surprising.

reply

lucozade 1 hour ago

Of course. What you were expecting was 6 right?

You learnt ASCII. Then you learnt about escaping to get around limitations in ASCII. Then you learnt that Unicode got around non-Latin issues with ASCII.

Each step has added cognitive load. It seems surprising to me that the next step wouldn't.

reply

nostrademons 2 hours ago

Swift has very sane way of doing this: Strings are sequences of Characters. Characters represent extended grapheme clusters. Iteration consumes as many bytes from the string as are necessary to read the next grapheme. In most cases, this is straight-up UTF-8 decoding, which can be implemented very efficiently; you only need to worry about extended grapheme clusters when you get to the non-ASCII subset of Unicode, so the code paths that require unicode table lookups are infrequently exercised.

String searching & matching return an opaque String.Index type. I assume that under the hood they do Boyer-Moore on bytes and String.Index is a byte offset, but the important thing is that String.Index values are not convertible to integers, and so you never run into the case where a user passes in a byte offset that would slice a grapheme in half. Instead, String.Index has properties for the next and previous index and a method to advance by an arbitrary amount, so you'd access the 7th character after a dash in a string as myString[myString.rangeOfString("-").advanceBy(7)]

Swift gives up on the idea that len(x+y) = len(x) + len(y). This is just something you have to remember; in UI programming, however, it makes a lot of sense because adding an accent to an existing string isn't going to make it take up more width in the text box. (x + y).utf8.count == x.utf8.count + y.utf8.count, however.

https://developer.apple.com/library/ios/documentation/Swift/...

If I were to adapt this to a domain like Python's, where string processing is pretty important, I'd allow indexing & slicing by integers (including negative integers), but I'd define them in terms of String.Index types, which iterate over extended graphemes under the hood:

  str[2] == str[str.start.advanceBy(2)]
  str[-3] == str[str.end.advanceBy(-3)]
  str[str.find('foo') + 3] == str[str.find('foo').advanceBy(3)]
  str[0:-3] == str.substring(str.start, str.end.advanceBy(-3))
  del str[0:3] == str.removeRange(str.start, str.start.advanceBy(3))

reply

Veedrac 2 hours ago

> Swift gives up on the idea that len(x+y) = len(x) + len(y). This is just something you have to remember; in UI programming, however, it makes a lot of sense because adding an accent to an existing string isn't going to make it take up more width in the text box. (x + y).utf8.count == x.utf8.count + y.utf8.count, however.

So do fullwidth characters have a length of 2?

If not, how is this useful for finding the size of text in a text box anyway?

reply

nostrademons 2 hours ago

I'm speaking more to the idea that when you concatenate strings, various important properties might not change. To get the actual size in pixels, you'd do str.sizeWithAttributes([NSFontAttributeName?: myFont]).width, which is a whole other can of worms.

reply

lmm 2 hours ago

Python is an abstraction a long way above the Von Neumann machine. It should not expose implementation details like the number of bytes a string happens to be stored as. The whole point of the Python 2 -> 3 transition was to disallow just treating a string as a random pile of bytes and hoping it all works out.

reply

---

summary of/notes on http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/ :

" Let's say we want to implement a simple cat In other terms, these are the applications we want to write in Python 2 terms:

import sys import shutil

for filename in sys.argv[1:]: f = sys.stdin if filename != '-': try: f = open(filename, 'rb') except IOError as err: print >> sys.stderr, 'cat.py: %s: %s' % (filename, err) continue with f: shutil.copyfileobj(f, sys.stdout)

...

Unicode in Unix

In Python 2 the above code is dead simple because you implicitly work with bytes everywhere. The command line arguments are bytes, the filenames are bytes (ignore Windows users for a moment) and the file contents are bytes too. Purists will point out that this is incorrect and really that's where the problem is coming from, but if you start thinking about it more, you will realize that this is an unfixable problem.

UNIX is bytes, has been defined that way and will always be that way. To understand why you need to see the different contexts in which data is being passed through:

    the terminal
    command line arguments
    the operating system io layer
    the filesystem driver

That btw, is not the only thing this data might be going through but let's go with this for the moment. In how many of the situations do we know an encoding? The answer is: in none of them.

...

 The C locale is the only locale that POSIX actually specifies and it says: encoding is ASCII and all responses from command line tools in regards to languages are like they are defined in the POSIX spec.

In the above case of our cat tool there is no other way to treat this data as if it was bytes. The reason for this is, that there is no indication on the shell what the data is. For instance if you invoke cat hello.txt the terminal will pass hello.txt encoded in the encoding of the terminal to your application.

But now imagine the other case: echo *. The shell will now pass all the filenames of the current directory to your application. Which encoding are they in? In whatever encoding the filenames are in. There is no filename encoding!

...

Unicode Madness?

Now a Windows person will probably look at this and say: what the hell are the UNIX people doing. But it's not that dire or not dire at all. The reason this all works is because some clever people designed the system to be backwards compatible. Unlike Windows where all APIs are defined twice, on POSIX the best way to deal with all of this is to assume it's a byte mess that for display purposes is decoded with an encoding hint.

For instance let's take the case of the cat command above. As you might have noticed there is an error message for files it cannot open because they either don't exist or because they are protected or whatever else. In the simple case above let's assume the file is encoded in latin1 garbage because it came from some external drive from 1995. The terminal will get our standard output and will try to decode it as utf-8 because that's what it thinks it's working with. Because that string is latin1 and not the right encoding it will now not decode properly. But fear not, nothing is crashing, because your terminal will just ignore the things it cannot deal with. It's clever like this.

How does it look like for GUIs? They have two versions of each. When a GUI like Nautilus lists all files it makes a symbol for each file. It associates the internal bytes of that filename with the icon for double clicking and secondly it attempts to make a filename it can show for display purposes which might be decoded from something. For instance it will attempt decoding from utf-8 with replacing decoding errors with question marks. Your filename might not be entirely readable but you can still open the file. Success!

Unicode on UNIX is only madness if you force it on everything. But that's not how Unicode on UNIX works. UNIX does not have a distinction between unicode and byte APIs. They are one and the same which makes them easy to deal with. The C Locale

Nowhere does this show up as much as with the C locale. The C locale is the escape hatch of the POSIX specification to enforce everybody to behave the same. A POSIX compliant operating system needs to support setting LC_CTYPE to C and to force everything to be ASCII.

This locale is traditionally picked in a bunch of different situations. Primarily you will find this locale for any program launched from cron, your init system, subprocesses with an empty environment etc. The C locale restores a sane ASCII land on environments where you otherwise could not trust anything.

...

Python 3 Dies in Flames

Python 3 takes a very difference stance on Unicode than UNIX does. Python 3 says: everything is Unicode (by default, except in certain situations, and except if we send you crazy reencoded data, and even then it's sometimes still unicode, albeit wrong unicode). Filenames are Unicode, Terminals are Unicode, stdin and out are Unicode, there is so much Unicode! And because UNIX is not Unicode, Python 3 now has the stance that it's right and UNIX is wrong, and people should really change the POSIX specification to add a C.UTF-8 encoding which is Unicode. And then filenames are Unicode, and terminals are Unicode and never ever will you see bytes again although obviously everything still is bytes and will fail.

And it's not just me saying this. These are bugs in Python related to this braindead idea of doing Unicode:

    ASCII is a bad filesystem default encoding
    Use surrogateescape as default error handler
    Python 3 raises Unicode errors in the C locale
    LC_CTYPE=C: pydoc leaves terminal in an unusable state (this is relevant to Click because the pager support is provided by the stdlib pydoc module)

But then if you Google around you will find so much more. Just check how many people failed to install their pip packages because the changelog had umlauts in it. Or because their home folder has an accent in it. Or because their SSH session negotates ASCII, or because they are connecting from Putty. The list goes on and one.

...

Python 3 Cat

Now let's start fixing cat for Python 3. How do we do this? Well first of all we now established that we need to deal with bytes because someone might echo something which is not in the encoding the shell says. So at the very least the file contents need to be bytes. But then we also need to open the standard output to support bytes which it does not do by default. We also need to deal with the case separately where the Unicode APIs crap out on us because the encoding is C. So here it is, feature compatible cat for Python 3:

import sys import shutil

def _is_binary_reader(stream, default=False): try: return isinstance(stream.read(0), bytes) except Exception: return default

def _is_binary_writer(stream, default=False): try: stream.write(b) except Exception: try: stream.write() return False except Exception: pass return default return True

def get_binary_stdin(): # sys.stdin might or might not be binary in some extra cases. By # default it's obviously non binary which is the core of the # problem but the docs recomend changing it to binary for such # cases so we need to deal with it. Also someone might put # StringIO? there for testing. is_binary = _is_binary_reader(sys.stdin, False) if is_binary: return sys.stdin buf = getattr(sys.stdin, 'buffer', None) if buf is not None and _is_binary_reader(buf, True): return buf raise RuntimeError?('Did not manage to get binary stdin')

def get_binary_stdout(): if _is_binary_writer(sys.stdout, False): return sys.stdout buf = getattr(sys.stdout, 'buffer', None) if buf is not None and _is_binary_writer(buf, True): return buf raise RuntimeError?('Did not manage to get binary stdout')

def filename_to_ui(value): # The bytes branch is unecessary for *this* script but otherwise # necessary as python 3 still supports addressing files by bytes # through separate APIs. if isinstance(value, bytes): value = value.decode(sys.getfilesystemencoding(), 'replace') else: value = value.encode('utf-8', 'surrogateescape') \ .decode('utf-8', 'replace') return value

binary_stdout = get_binary_stdout() for filename in sys.argv[1:]: if filename != '-': try: f = open(filename, 'rb') except IOError as err: print('cat.py: %s: %s' % ( filename_to_ui(filename), err ), file=sys.stderr) continue else: f = get_binary_stdin()

    with f:
        shutil.copyfileobj(f, binary_stdout)

And this is not the worst version. Not because I want to make things extra complicated but because it is complicated now. For instance what's not done in this example is to forcefully flush the text stdout before fetching the binary one. In this example it's not necessary because print calls here go to stderr instead of stdout, but if you would want to print to stdout instead, you would have to flush. Why? Because stdout is a buffer on top of another buffer and if you don't flush it forefully you might get output in wrong order.

....

Dancing The Encoding Dance

To understand the live of a filename parameter to the shell, this is btw now what happens on Python 3 worst case:

    the shell passes the filename as bytes to the script
    the bytes are being decoded from the expected encoding by Python before they ever hit your code. Because this is a lossy process, Python 3 applies an special error handler that encodes encoding errors as surrogates into the string.
    the python code then encounters a file not existing error and needs to format an error message. Because we write to a text stream we cannot write surrogates out as they are not valid unicode. Instead we now
    encode the unicode string with the surrogates to utf-8 and tell it to handle the surrogate escapes as it.
    then we decode from utf-8 and tell it to ignore errors.
    the resulting string now goes back out to our text only stream (stderr)
    after which the terminal will decode our string for displaying purposes.

Here is what happens on Python 2:

    the shell passes the filename as bytes to the script.
    the shell decodes our string for displaying purposes.

And because no string handling happens anywhere there the Python 2 version is just as correct if not more correct because the shell then can do a better job at showing the filename (for instance it could highlight the encoding errors if it woudl want. In case of Python 3 we need to handle the encoding internally so that's no longer possible to detect for the shell).

Note that this is not making the script less correct. In case you would need to do actual string handling on the input data you would switch to Unicode handling in 2.x or 3.x. But in that case you also want to support a --charset parameter on your script explicitly so the work is pretty much the same on 2.x and 3.x anyways. Just that it's worse because for that to work on 3.x you need to construct the binary stdout first which is unnecessary on 2.x.

...

Python 3 might be large enough that it will start to force UNIX to go the Windows route and enforce Unicode in many places, but really, I doubt it.

The much more likely thing to happen is that people stick to Python 2 or build broken stuff on Python 3. Or they go with Go. Which uses an even simpler model than Python 2: everything is a byte string. The assumed encoding is UTF-8. End of the story. "


note: unicode is not capable of storing (via escaping) arbitrary binary data (at least not without rolling your own escaping protocol, unless there's one in the standard that i'm unaware of; the point is that roundtripping plain bytes using the standard encoding and decoding functions is not the identity function; imo this is too bad, it wouldn't have been much effort for unicode to have supported this):

http://haacked.com/archive/2012/01/30/hazards-of-converting-binary-data-to-a-string.aspx/

---

" Python 2 tries to be helpful when working with unicode and byte strings. If you try to perform a string operation that combines a unicode string with a byte string, Python 2 will automatically decode the byte string to produce a second unicode string, then will complete the operation with the two unicode strings.

For example, we try to concatenate a unicode "Hello " with a byte string "world". The result is a unicode "Hello world". On our behalf, Python 2 is decoding the byte string "world" using the ASCII codec. The encoding used for these implicit decodings is the value of sys.getdefaultencoding().

The implicit encoding is ASCII because it's the only safe guess: ASCII is so widely accepted, and is a subset of so many encodings, that it's unlikely to produce false positives.

Implicit decoding errors

Of course, these implicit decodings are not immune to decoding errors. If you try to combine a byte string with a unicode string and the byte string can't be decoded as ASCII, then the operation will raise a UnicodeDecodeError?.

This is the source of those painful UnicodeErrors?. Your code inadvertently mixes unicode strings and byte strings, and as long as the data is all ASCII, the implicit conversions silently succeed. Once a non-ASCII character finds its way into your program, an implicit decode will fail, causing a UnicodeDecodeError?. "