upvote
rspeer 1 day ago
link |
The thing I'm looking forward to in Python 3.4 is that you should be able to follow the wise advice about how to handle text in the modern era:
"Text is always Unicode. Read it in as UTF-8. Write it out as UTF-8. Everything in between just works."
This was not true up through 3.2, because Unicode in Python <= 3.2 was an abstraction that leaked some very unfortunate implementation details. There was the chance that you were on a "narrow build" of Python, where Unicode characters in memory were fixed to be two bytes long, so you couldn't perform most operations on characters outside the Basic Multilingual Plane. You could kind of fake it sometimes, but it meant you had to be thinking about "okay, how is this text really represented in memory" all the time, and explicitly coding around the fact that two different installations of Python with the same version number have different behavior.
Python 3.3 switched to a flexible string representation that eliminated the need for narrow and wide builds. However, operations in this representation weren't tested well enough for non-BMP characters, so running something like text.lower() on arbitrary text could now give you a SystemError?