ideas-computer-jasper-jasperStringNotes1

upvote

rspeer 1 day ago

link

The thing I'm looking forward to in Python 3.4 is that you should be able to follow the wise advice about how to handle text in the modern era:

"Text is always Unicode. Read it in as UTF-8. Write it out as UTF-8. Everything in between just works."

This was not true up through 3.2, because Unicode in Python <= 3.2 was an abstraction that leaked some very unfortunate implementation details. There was the chance that you were on a "narrow build" of Python, where Unicode characters in memory were fixed to be two bytes long, so you couldn't perform most operations on characters outside the Basic Multilingual Plane. You could kind of fake it sometimes, but it meant you had to be thinking about "okay, how is this text really represented in memory" all the time, and explicitly coding around the fact that two different installations of Python with the same version number have different behavior.

Python 3.3 switched to a flexible string representation that eliminated the need for narrow and wide builds. However, operations in this representation weren't tested well enough for non-BMP characters, so running something like text.lower() on arbitrary text could now give you a SystemError? (http://bugs.python.org/issue18183).

With that bug fixed in Python 3.4, that removes the last thing I know of standing in the way of Unicode just working.

reply

--

http://mortoray.com/2013/11/27/the-string-type-is-broken/

https://news.ycombinator.com/item?id=6807524

---

" agentultra 2 days ago

link

...

> Default unicode strings are obscenely annoying to me. Almost all of my code deals with binary data, parsing complex data structures, etc. The only "human readable" strings in my code are logs. Why the hell should I worry about text encoding before sending a string into a TCP socket...

> The fact that the combination of words "encode" and "UTF8" appear in my code, and str.encode('hex') is no longer available, is a very good representation of why I hate Python 3.

I'm afraid I don't understand your complaint. If you're parsing binary data then Python 3 is clearly superior than Python 2:

    >>> "Hello, Gådel".encode("utf-8")
    b'Hello, G\xc3\xa5del'

Seems much more reasonable than:

    >>> "Hello, Gådel".encode("utf-8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

Because they're not the same thing. Python 2 would implicitly "promote" a bytestring (the default literal) to a unicode object so long as it contained ASCII bytes. Of course this gets really tiresome and leads to Python 2's, "unicode dance." Armin seems to prefer it to the extra leg-work for correct unicode handling in Python 3 [0] however I think the trade-off is worth it and that pain will fade when the wider world catches up.

http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/

...

If your Python 3 code is dealing with binary data, you would use byte strings and you would never have to call encode or touch UTF-8 before passing the byte string to a TCP socket.

What you're saying about Unicode scares me. If you haven't already, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) before writing any libraries that I might depend on.

reply

keyme 2 days ago

link

> you would use byte strings and you would never have to call encode or touch UTF-8 before passing the byte string to a TCP socket.

I'll start by adding that it's also incredibly annoying to declare each string to be a byte string, if this wasn't clear from my original rant.

Additionally, your advice is broken. Take a look at this example, with the shelve module (from standard library).

s=shelve.open('/tmp/a')

s[b'key']=1

Results in:

AttributeError?: 'bytes' object has no attribute 'encode'

So in this case, my byte string can't be used as a key here, apparently. Of course, a string is expected, and this isn't really a string. My use case was trying to use a binary representation of a hash as the key here. What's more natural than that. Could easily do that in Python 2. Not so easy now.

I can find endless examples for this, so your advice about "just using byte strings" is invalid. Conversions are inevitable. And this annoys me.

> What you're saying about Unicode scares me.

Yeah, I know full well what you're scared of. If I'm designing everything from scratch, using Unicode properly is easy. This, however, is not the case when implementing existing protocols, or reading file formats that don't use Unicode. That's where things begin being annoying when your strings are no longer strings.

deathanatos 2 days ago

link

> Additionally, your advice is broken. Take a look at this example, with the shelve module (from standard library).

His advice was sound, and referred to your example of TCP stream data (which is binary). Your example regards the shelve library.

> So in this case, my byte string can't be used as a key here, apparently.

shelve requires strings as a keys. This is documented, though not particularly clearly.

> so your advice about "just using byte strings" is invalid.

Let me attempt to rephrase his advice. Use bytestrings where your data is a string of bytes. Use strings where your data is human-readable text. Covert to and from bytestrings when serializing to something like network or storage.

> Conversions are inevitable.

Absolutely, because bytestrings and (text)strings are two different types.

> And this annoys me.

There is no real alternative though, because there is no way to automatically convert between the two. Python 2 made many assumptions, and these were often invalid and led to bugs. Python 3 does not; in places where it does not have the required information, you must provide it.

> when implementing existing protocols, or reading file formats that don't use Unicode.

I'm afraid it's still the same. A protocol that uses unicode requires you to code something like "decode('utf-8')" (if UTF-8 is what it uses), one that does not requires "decode('whatever-it-uses-instead')". If it isn't clear what encoding the file format or protocol stores textual data in, then that's a bug with the file format or protocol, not Python. Regardless though, Python doesn't know (and can't know) what encoding the file or protocol uses.

"

tungwaiyip 1 day ago

link

Keyme, you and I are the few that have serious concern with the design of Python 3. I started to embrace it in a big way 6 months ago when most libraries I use are available in Python. I wish to say the wait is over and we should all move to Python 3 then everything will be great. Instead I find no compelling advantage. Maybe there will be when I start to use unicode string more. Instead I'm really annoyed by the default iterator and the binary string handling. I am afraid it is not a change for the good.

I come from the Java world when people take a lot of care to implement things as streams. It was initially shocking to see Python read an entire file into memory, turn it into list or other data structure with no regard to memeory usage. Then I have learned this work perfectly well when you have a small input, a few MB or so is a piece of cake for modern computer. It takes all the hassle out of setting up streams in Java. You optimize when you need to. But for 90% of stuff, a materialized list works perfectly well.

Now Python become more like Java in this respect. I can't do exploratory programming easily without adding list(). Many times I run into problem when I am building complex data structure like list of list, and end up getting a list of iterator. It takes the conciseness out of Python when I am forced to deal with iterator and to materialize the data.

The other big problem is the binary string. Binary string handling is one of the great feaute of Python. It it so much more friendly to manipulate binary data in Python compare to C or Java. In Python 3, it is pretty much broken. It would be an easy transition I only need to add a 'b' prefix to specify it as binary string literal. But in fact, the operation on binary string is so different from regular string that it is just broken.

  In [38]: list('abc')
  Out[38]: ['a', 'b', 'c']
  
  In [37]: list(b'abc')       # string become numbers??
  Out[37]: [97, 98, 99]
  
  In [43]: ''.join('abc')
  Out[43]: 'abc'
  
  In [44]: ''.join(b'abc')    # broken, no easy way to join them back into string
  ---------------------------------------------------------------------------
  TypeError                                 Traceback (most recent call last)
  <ipython-input-44-fcdbf85649d1> in <module>()
  ----> 1 ''.join(b'abc')
  
  TypeError: sequence item 0: expected str instance, int found

reply

keyme 1 day ago

link

Yes! Thank you.

All the other commenters here that are explaining things like using a list() in order to print out an iterator are missing the point entirely.

The issue is "discomfort". Of course you can write code that makes everything work again. This isn't the issue. It's just not "comfortable". This is a major step backwards in a language that is used 50% of the time in an interactive shell (well, at least for some of us).

reply

agentultra 1 day ago

link

The converse problem is having to write iterator versions of map, filter, and other eagerly-evaluated builtins. You can't just write:

    >>> t = iter(map(lambda x: x * x, xs))

Because the map() call is eagerly evaluated. It's much easier to exhaust an iterator in the list constructor and leads to a consistent iteration API throughout the language.

If that makes your life hard then I feel sorry for you, son. I've got 99 problems but a list constructor ain't one.

The Python 3 bytes object is not intended to be the same as the Python 2 str object. They're completely separate concepts. Any comparison is moot.

Think of the bytes object as a dynamic char[] and you'll be less inclined to confusion and anger:

    >>> list(b'abc')
    [97, 98, 99]

That's not a list of numbers... that's a list of bytes!

    >>> "".join(map(lambda byte: chr(byte), b'abc'))
    'abc'

And you get a string!

reply

tungwaiyip 13 hours ago

link

What you are looking for is imap(). In Python 2 there are entire collection of iterator variants. You can choose to use either the list or the iterator variants.

The problem with Python 3 is the list version is removed. You are forced to use iterator all the time. Things become inconvenient and ugly as a result. Bugs are regularly introduced because I forget to apply list().

  >>> "".join(map(lambda byte: chr(byte), b'abc'))

Compares to .join('abc'), this is what I call fuck'd. Luckily maxerickson suggested a better method.

reply

dded 15 hours ago

link
  >    >>> list(b'abc')
  >    [97, 98, 99]
  >That's not a list of numbers... that's a list of bytes!

No, it's a list of numbers:

  >>> type(list(b'abc')[0])
  <class 'int'>

I think the GP mis-typed his last example. First, he showed that .join('abc') takes a string, busts it up, then concatenates it back to a string. Then, with .join(b'abc'), he appears to want to bust up a byte string and concatenate it back to a text string. But I suspect he meant to type this:

  >>> b''.join(b'abc')

That is, bust up a byte string and concatenate back to what you start with: a byte string. But that doesn't work, when you bust up a byte string you get a list of ints; and you cannot concatenate them back to a byte string (at least not elegantly).

reply

itsadok 1 day ago

link

> The converse problem is having to write iterator versions of map, filter, and other eagerly-evaluated builtins

Well, in Python 2 you just use imap instead of map. That way you have both options, and you can be explicit rather than implicit.

> That's not a list of numbers... that's a list of bytes!

The point being made here is not that some things are not possible in Python 3, but rather than things that are natural in Python 2 are ugly in 3. I believe you're proving the point here. The idea that b'a'[0] == 97 in such a fundamental way that I might get one when I expected the other may be fine in C, but I hold Python to a higher standard.

reply

maxerickson 1 day ago

link
  >>> bytes(list(b'abc'))
  b'abc'
  >>>

That is, the way to turn a list of ints into a byte string is to pass it to the bytes object.

(This narrowly addresses that concern, I'd readily concede that the new API is going to have situations where it is worse)

reply

tungwaiyip 14 hours ago

link

Thank you. This is good to know. I was rather frustrated to find binary data handling being changed with no easy translation in Python 3.

Here is another annoyance:

    In [207]: 'abc'[0] + 'def'
    Out[207]: 'adef'
    In [208]: b'abc'[0] + b'def'
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-208-458c625ec231> in <module>()
    ----> 1 b'abc'[0] + b'def'
    TypeError: unsupported operand type(s) for +: 'int' and 'bytes'

reply

maxerickson 7 hours ago

link

I don't have enough experience with either version to debate the merits of the choice, but the way forward with python 3 is to think of bytes objects as more like special lists of ints, where if you want a slice (instead of a single element) you have to ask for it:

    >>> [1,2,3][0]
    1
    >>> [1,2,3][0:1]
    [1]
    >>> b'abc'[0]
    97
    >>> b'abc'[0:1]
    b'a'
    >>> 

So the construction you want is just:

    >>> b'abc'[0:1]+b'def'
    b'adef'

Which is obviously worse if you are doing it a bunch of times, but it is at least coherent with the view that bytes are just collections of ints (and there are situations where indexing operations returning an int is going to be more useful).

reply

tungwaiyip 4 hours ago

link

In Java, String and Char are two separate types. In Python, there is no separate char type. It is simply a string of length of 1. I do not have great theory to show which design is better either. I can only say the Python design work great for me in the past (for both text and binary string), and I suspect it is the more user friendly design of the two.

So in Python 3 the design of binary string is changed. Unlike the old string, bytes and binary string of length 1 are not the same. Working codes are broken, practice have to be changed, often it involves more complicated code (like [0] becomes [0:1]). All these happens with no apparent benefit other than it is more "coherent" in the eye of some people. This is the frustration I see after using Python 3 for some time.

reply

keyme 2 days ago

link

If I recall, writing no boilerplate code was a big deal in python once...

And while 2 lines are not worth my rant, writing those 2 lines again and again all the time, is.

reply

baq 1 day ago

link

i'd argue this is not boilerplate, more like a shortcut for your particular use case:

    import codecs
    enc = lambda x: codecs.encode(x, 'hex')
    

i have a program in python 2 that uses this approach, because i have a lot of decoding from utf and encoding to a different charset to do. python 3 is absolutely the same for me.

reply

gkya 1 day ago

link

> Why the hell should I worry about text encoding before sending a string into a TCP socket...

A string represents a snippet of human readable text and is not merely an array of bytes in a sane world. Thus is it fine & sane to have to encode a string before sticking it into a socket, as sockets are used to transfer bytes from point a to b, not text.

reply

dded 1 day ago

link

Not arguing that you're wrong, but Unix/Linux is not a sane world by your definition. Whether we like it or not (I do like it), this is the world many of us live in. Python3 adds a burden in this world where none existed in Python2. In exchange, there is good Unicode support, but not everyone uses that. I can't help but wonder if good Unicode support could have been added in a way that preserved Python2 convenience with Unix strings.

(Please note that I'm not making any statement as to what's appropriate to send down a TCP socket.)

reply

agentultra 1 day ago

link

ASCII by default is only an accident of history. It's going to be a slow, painful process but all human-readable text is going to be Unicode at some point. For historical reasons you'll still have to encode a vector of bytes full of character information to send it down the pipe but there's no reason why we shouldn't be explicit about it.

The pain is painful [in Python 3] primarily for library authors and only at the extremities. If you author your libraries properly your users won't even notice the difference. And in the end as more protocols and operating systems adopt better encodings for Unicode support that pain will fade (I'm looking at you, surrogateescape).

It's better to be ahead of the curve on this transition so that users of the language and our libraries won't get stuck with it. Python 2 made users have to think (or forget) about Unicode (and get it wrong every time... the shear amount of work I've put into fixing codebases that mixed bytes and unicode objects without thinking about it made me a lot of money but cost me a few years of my life I'm sure).

reply

dded 1 day ago

link

I was careful to say "Unix strings", not "ASCII". A Unix string contains no nul byte, but that's about the only rule. It's certainly not necessarily human-readable.

I don't think a programming language can take the position that an OS needs to "adopt better encodings". Python must live in the environment that the OS actually provides. It's probably a vain hope that Unix strings will vanish in anything less than decades (if ever), given the ubiquity of Unix-like systems and their 40 years of history.

I understand that Python2 does not handle Unicode well. I point out that Python3 does not handle Unix strings well. It would be good to have both.

reply

gkya 1 day ago

link

> I was careful to say "Unix strings"

This is the first time I encounter the idiom Unix strings. I'll map it to array of bytes in my table of idioms.

> I don't think a programming language can take the position that an OS needs to "adopt better encodings".

I do think that programming languages should take a position on things, including but not limited to how data is represented and interpreted in itself. A language is expected to provide some abstractions, and whether a string is an array of bytes or an array of characters is a consideration of a language designer, who will end up designing a language takes one or another of the sides available.

Python has taken the side of language user: enabled Unicode names, defaulted to Unicode strings, defaulted to classes being subclasses of the 'object' class... Unix has taken the side of machine (which was the side at the time of Unix's inception.

> [...] probably a vain hope that Unix strings will vanish [...]

If only we wait for them to vanish, doing nothing to improve.

> Python must live in the environment that the OS actually provides.

Yes, Python must indeed live in the OS' environment. Regardless, one need not be a farmer because they live among all farmers, need they?

reply

dded 1 day ago

link

> This is the first time I encounter the idiom Unix strings

The usual idiom is C-strings, but I wanted to emphasize the OS, not the language C.

>> [...] probably a vain hope that Unix strings will vanish [...] >If only we wait for them to vanish, doing nothing to improve.

The article is about the lack of Python3 adoption. In my case, Python3's poor handling of Unix/C strings is friction. It sounds like you believe that Unix/C strings can be made to go away in the near future. I do not believe this. (I'm not even certain that it's a good idea.)

reply

gkya 18 hours ago

link

I do not insist that C strings must die, I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present. I fully support strings to be Unicode-by-default in Python, as most people will put text in between double quotes, not a bunch of bytes represented by textual characters.

I do not expect C or Unix interpretations of strings to change, but I believe that they must be considered low-level and require higher-level language user to explicitly request the compiler to interpret a piece of data in such fashion.

My first name is "Göktuğ". Honestly, which one of the following is rather desirable for me, do you think?

  Python 2.7.4 (default, Sep 26 2013, 03:20:26) 
  >>> "Göktuğ"
  'G\xc3\xb6ktu\xc4\x9f'

or

  Python 3.3.1 (default, Sep 25 2013, 19:29:01) 
  >>> "Göktuğ"
  'Göktuğ'

reply

dded 16 hours ago

link

I'm not arguing against you. I just don't write any code that has to deal with people's names, so that's just not a problem that I face. I fully acknowledge that lack of Unicode is a big problem of Python2, but it's not my problem.

A Unix filename, on the other hand, might be any sort of C string. This sort of thing is all over Unix, not just filenames. (When I first ever installed Python3 at work back when 3.0 (3.1?) came out, one of the self tests failed when it tried to read an unkosher string in our /etc/passwd file.) When I code with Python2, or Perl, or C, or Emacs Lisp, I don't need to worry about these C strings. They just work.

My inquiry, somewhere up this thread, is whether or not it would be possible to solve both problems. (Perhaps by defaulting to utf-8 instead of ASCII. I don't know, I'm not a language designer.)

reply

dded 15 hours ago

link

> I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present

OK, maybe I do see one small point to argue. A C string, such as one that might be used in Unix, is not necessarily text. But text, represented as utf-8, is a C string.

It seems like there's something to leverage here, at least for those points at which Python3 interacts with the OS.

reply

--

nabla9 7 hours ago

link

UTF-8 is usually good enough in disk.

I would like to have at least two options in memory: utf-8 and vector of displayed characters (there's many combinations in use in existing modern languages with no single-character representations in UTF-<anything>).

reply

Guvante 7 hours ago

link

Do you need a vector of displayed characters?

Usually all you care about is the rendered size, which your rendering engine should be able to tell you. No need to be able to pick out those characters in most situations.

reply

nabla9 6 hours ago

link

Yes. If I want to work with language and do some stringology, that's what I want. I might want to swap some characters, find length of words etc. To have vector of characters (as what humans consider characters) is valuable.

reply

pornel 3 hours ago

link

> To have vector of characters (as what humans consider characters) is valuable.

That might be an awful can of worms. Are Arabic vowels characters? "ij" letter in Dutch? Would you separate Korean text into letters or treat each block of letters as a character?

reply

---

optimiz3 7 hours ago

link

Most of the post talks about how Windows made a poor design decision in choosing 16bit characters.

No debate there.

However, advocating "just make windows use UTF8" ignores the monumental engineering challenge and legacy back-compat issues.

In Windows most APIs have FunctionA? and FunctionW? versions, with FunctionA? meaning legacy ASCII/ANSI and FunctionW? meaning Unicode. You couldn't really fix this without adding a 3rd version that was truly UTF-8 without breaking lots of apps in subtle ways.

Likely it would also only be available to Windows 9 compatible apps if such a feature shipped.

No dev wanting to make money is going to ship software that only targets Windows 9, so the entire ask is tough to sell.

Still no debate on the theoretical merits of UTF-8 though.

reply

gamacodre 3 hours ago

link

We "solved" (worked around? hacked?) this by creating a set of FunctionU? macros and in some cases stubs that wrap all of the Windows entry points we use with incoming and outgoing converters. It's ugly under the hood and a bit slower than it needs to be, but the payoff of the app being able to consistently "think" in UTF-8 has been worth it.

Of course, we had to ditch resource-based string storage anyway for other cross-platform reasons, and were never particularly invested in the "Windows way" of doing things, so it wasn't a big shock to our developers when we made this change.

reply

angersock 7 hours ago

link

Nothing worth doing is easy.

Anyways, the FunctionA?/FunctionW?