proj-oot-old-150618-ootStringNotes1

upvote

rspeer 1 day ago

link

The thing I'm looking forward to in Python 3.4 is that you should be able to follow the wise advice about how to handle text in the modern era:

"Text is always Unicode. Read it in as UTF-8. Write it out as UTF-8. Everything in between just works."

This was not true up through 3.2, because Unicode in Python <= 3.2 was an abstraction that leaked some very unfortunate implementation details. There was the chance that you were on a "narrow build" of Python, where Unicode characters in memory were fixed to be two bytes long, so you couldn't perform most operations on characters outside the Basic Multilingual Plane. You could kind of fake it sometimes, but it meant you had to be thinking about "okay, how is this text really represented in memory" all the time, and explicitly coding around the fact that two different installations of Python with the same version number have different behavior.

Python 3.3 switched to a flexible string representation that eliminated the need for narrow and wide builds. However, operations in this representation weren't tested well enough for non-BMP characters, so running something like text.lower() on arbitrary text could now give you a SystemError? (http://bugs.python.org/issue18183).

With that bug fixed in Python 3.4, that removes the last thing I know of standing in the way of Unicode just working.

reply

--

http://mortoray.com/2013/11/27/the-string-type-is-broken/

https://news.ycombinator.com/item?id=6807524

---

" agentultra 2 days ago

link

...

> Default unicode strings are obscenely annoying to me. Almost all of my code deals with binary data, parsing complex data structures, etc. The only "human readable" strings in my code are logs. Why the hell should I worry about text encoding before sending a string into a TCP socket...

> The fact that the combination of words "encode" and "UTF8" appear in my code, and str.encode('hex') is no longer available, is a very good representation of why I hate Python 3.

I'm afraid I don't understand your complaint. If you're parsing binary data then Python 3 is clearly superior than Python 2:

    >>> "Hello, Gådel".encode("utf-8")
    b'Hello, G\xc3\xa5del'

Seems much more reasonable than:

    >>> "Hello, Gådel".encode("utf-8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

Because they're not the same thing. Python 2 would implicitly "promote" a bytestring (the default literal) to a unicode object so long as it contained ASCII bytes. Of course this gets really tiresome and leads to Python 2's, "unicode dance." Armin seems to prefer it to the extra leg-work for correct unicode handling in Python 3 [0] however I think the trade-off is worth it and that pain will fade when the wider world catches up.

http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/

...

If your Python 3 code is dealing with binary data, you would use byte strings and you would never have to call encode or touch UTF-8 before passing the byte string to a TCP socket.

What you're saying about Unicode scares me. If you haven't already, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) before writing any libraries that I might depend on.

reply

keyme 2 days ago

link

> you would use byte strings and you would never have to call encode or touch UTF-8 before passing the byte string to a TCP socket.

I'll start by adding that it's also incredibly annoying to declare each string to be a byte string, if this wasn't clear from my original rant.

Additionally, your advice is broken. Take a look at this example, with the shelve module (from standard library).

s=shelve.open('/tmp/a')

s[b'key']=1

Results in:

AttributeError?: 'bytes' object has no attribute 'encode'

So in this case, my byte string can't be used as a key here, apparently. Of course, a string is expected, and this isn't really a string. My use case was trying to use a binary representation of a hash as the key here. What's more natural than that. Could easily do that in Python 2. Not so easy now.

I can find endless examples for this, so your advice about "just using byte strings" is invalid. Conversions are inevitable. And this annoys me.

> What you're saying about Unicode scares me.

Yeah, I know full well what you're scared of. If I'm designing everything from scratch, using Unicode properly is easy. This, however, is not the case when implementing existing protocols, or reading file formats that don't use Unicode. That's where things begin being annoying when your strings are no longer strings.

deathanatos 2 days ago

link

> Additionally, your advice is broken. Take a look at this example, with the shelve module (from standard library).

His advice was sound, and referred to your example of TCP stream data (which is binary). Your example regards the shelve library.

> So in this case, my byte string can't be used as a key here, apparently.

shelve requires strings as a keys. This is documented, though not particularly clearly.

> so your advice about "just using byte strings" is invalid.

Let me attempt to rephrase his advice. Use bytestrings where your data is a string of bytes. Use strings where your data is human-readable text. Covert to and from bytestrings when serializing to something like network or storage.

> Conversions are inevitable.

Absolutely, because bytestrings and (text)strings are two different types.

> And this annoys me.

There is no real alternative though, because there is no way to automatically convert between the two. Python 2 made many assumptions, and these were often invalid and led to bugs. Python 3 does not; in places where it does not have the required information, you must provide it.

> when implementing existing protocols, or reading file formats that don't use Unicode.

I'm afraid it's still the same. A protocol that uses unicode requires you to code something like "decode('utf-8')" (if UTF-8 is what it uses), one that does not requires "decode('whatever-it-uses-instead')". If it isn't clear what encoding the file format or protocol stores textual data in, then that's a bug with the file format or protocol, not Python. Regardless though, Python doesn't know (and can't know) what encoding the file or protocol uses.

"

tungwaiyip 1 day ago

link

Keyme, you and I are the few that have serious concern with the design of Python 3. I started to embrace it in a big way 6 months ago when most libraries I use are available in Python. I wish to say the wait is over and we should all move to Python 3 then everything will be great. Instead I find no compelling advantage. Maybe there will be when I start to use unicode string more. Instead I'm really annoyed by the default iterator and the binary string handling. I am afraid it is not a change for the good.

I come from the Java world when people take a lot of care to implement things as streams. It was initially shocking to see Python read an entire file into memory, turn it into list or other data structure with no regard to memeory usage. Then I have learned this work perfectly well when you have a small input, a few MB or so is a piece of cake for modern computer. It takes all the hassle out of setting up streams in Java. You optimize when you need to. But for 90% of stuff, a materialized list works perfectly well.

Now Python become more like Java in this respect. I can't do exploratory programming easily without adding list(). Many times I run into problem when I am building complex data structure like list of list, and end up getting a list of iterator. It takes the conciseness out of Python when I am forced to deal with iterator and to materialize the data.

The other big problem is the binary string. Binary string handling is one of the great feaute of Python. It it so much more friendly to manipulate binary data in Python compare to C or Java. In Python 3, it is pretty much broken. It would be an easy transition I only need to add a 'b' prefix to specify it as binary string literal. But in fact, the operation on binary string is so different from regular string that it is just broken.

  In [38]: list('abc')
  Out[38]: ['a', 'b', 'c']
  
  In [37]: list(b'abc')       # string become numbers??
  Out[37]: [97, 98, 99]
  
  In [43]: ''.join('abc')
  Out[43]: 'abc'
  
  In [44]: ''.join(b'abc')    # broken, no easy way to join them back into string
  ---------------------------------------------------------------------------
  TypeError                                 Traceback (most recent call last)
  <ipython-input-44-fcdbf85649d1> in <module>()
  ----> 1 ''.join(b'abc')
  
  TypeError: sequence item 0: expected str instance, int found

reply

keyme 1 day ago

link

Yes! Thank you.

All the other commenters here that are explaining things like using a list() in order to print out an iterator are missing the point entirely.

The issue is "discomfort". Of course you can write code that makes everything work again. This isn't the issue. It's just not "comfortable". This is a major step backwards in a language that is used 50% of the time in an interactive shell (well, at least for some of us).

reply

agentultra 1 day ago

link

The converse problem is having to write iterator versions of map, filter, and other eagerly-evaluated builtins. You can't just write:

    >>> t = iter(map(lambda x: x * x, xs))

Because the map() call is eagerly evaluated. It's much easier to exhaust an iterator in the list constructor and leads to a consistent iteration API throughout the language.

If that makes your life hard then I feel sorry for you, son. I've got 99 problems but a list constructor ain't one.

The Python 3 bytes object is not intended to be the same as the Python 2 str object. They're completely separate concepts. Any comparison is moot.

Think of the bytes object as a dynamic char[] and you'll be less inclined to confusion and anger:

    >>> list(b'abc')
    [97, 98, 99]

That's not a list of numbers... that's a list of bytes!

    >>> "".join(map(lambda byte: chr(byte), b'abc'))
    'abc'

And you get a string!

reply

tungwaiyip 13 hours ago

link

What you are looking for is imap(). In Python 2 there are entire collection of iterator variants. You can choose to use either the list or the iterator variants.

The problem with Python 3 is the list version is removed. You are forced to use iterator all the time. Things become inconvenient and ugly as a result. Bugs are regularly introduced because I forget to apply list().

  >>> "".join(map(lambda byte: chr(byte), b'abc'))

Compares to .join('abc'), this is what I call fuck'd. Luckily maxerickson suggested a better method.

reply

dded 15 hours ago

link
  >    >>> list(b'abc')
  >    [97, 98, 99]
  >That's not a list of numbers... that's a list of bytes!

No, it's a list of numbers:

  >>> type(list(b'abc')[0])
  <class 'int'>

I think the GP mis-typed his last example. First, he showed that .join('abc') takes a string, busts it up, then concatenates it back to a string. Then, with .join(b'abc'), he appears to want to bust up a byte string and concatenate it back to a text string. But I suspect he meant to type this:

  >>> b''.join(b'abc')

That is, bust up a byte string and concatenate back to what you start with: a byte string. But that doesn't work, when you bust up a byte string you get a list of ints; and you cannot concatenate them back to a byte string (at least not elegantly).

reply

itsadok 1 day ago

link

> The converse problem is having to write iterator versions of map, filter, and other eagerly-evaluated builtins

Well, in Python 2 you just use imap instead of map. That way you have both options, and you can be explicit rather than implicit.

> That's not a list of numbers... that's a list of bytes!

The point being made here is not that some things are not possible in Python 3, but rather than things that are natural in Python 2 are ugly in 3. I believe you're proving the point here. The idea that b'a'[0] == 97 in such a fundamental way that I might get one when I expected the other may be fine in C, but I hold Python to a higher standard.

reply

maxerickson 1 day ago

link
  >>> bytes(list(b'abc'))
  b'abc'
  >>>

That is, the way to turn a list of ints into a byte string is to pass it to the bytes object.

(This narrowly addresses that concern, I'd readily concede that the new API is going to have situations where it is worse)

reply

tungwaiyip 14 hours ago

link

Thank you. This is good to know. I was rather frustrated to find binary data handling being changed with no easy translation in Python 3.

Here is another annoyance:

    In [207]: 'abc'[0] + 'def'
    Out[207]: 'adef'
    In [208]: b'abc'[0] + b'def'
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-208-458c625ec231> in <module>()
    ----> 1 b'abc'[0] + b'def'
    TypeError: unsupported operand type(s) for +: 'int' and 'bytes'

reply

maxerickson 7 hours ago

link

I don't have enough experience with either version to debate the merits of the choice, but the way forward with python 3 is to think of bytes objects as more like special lists of ints, where if you want a slice (instead of a single element) you have to ask for it:

    >>> [1,2,3][0]
    1
    >>> [1,2,3][0:1]
    [1]
    >>> b'abc'[0]
    97
    >>> b'abc'[0:1]
    b'a'
    >>> 

So the construction you want is just:

    >>> b'abc'[0:1]+b'def'
    b'adef'

Which is obviously worse if you are doing it a bunch of times, but it is at least coherent with the view that bytes are just collections of ints (and there are situations where indexing operations returning an int is going to be more useful).

reply

tungwaiyip 4 hours ago

link

In Java, String and Char are two separate types. In Python, there is no separate char type. It is simply a string of length of 1. I do not have great theory to show which design is better either. I can only say the Python design work great for me in the past (for both text and binary string), and I suspect it is the more user friendly design of the two.

So in Python 3 the design of binary string is changed. Unlike the old string, bytes and binary string of length 1 are not the same. Working codes are broken, practice have to be changed, often it involves more complicated code (like [0] becomes [0:1]). All these happens with no apparent benefit other than it is more "coherent" in the eye of some people. This is the frustration I see after using Python 3 for some time.

reply

keyme 2 days ago

link

If I recall, writing no boilerplate code was a big deal in python once...

And while 2 lines are not worth my rant, writing those 2 lines again and again all the time, is.

reply

baq 1 day ago

link

i'd argue this is not boilerplate, more like a shortcut for your particular use case:

    import codecs
    enc = lambda x: codecs.encode(x, 'hex')
    

i have a program in python 2 that uses this approach, because i have a lot of decoding from utf and encoding to a different charset to do. python 3 is absolutely the same for me.

reply

gkya 1 day ago

link

> Why the hell should I worry about text encoding before sending a string into a TCP socket...

A string represents a snippet of human readable text and is not merely an array of bytes in a sane world. Thus is it fine & sane to have to encode a string before sticking it into a socket, as sockets are used to transfer bytes from point a to b, not text.

reply

dded 1 day ago

link

Not arguing that you're wrong, but Unix/Linux is not a sane world by your definition. Whether we like it or not (I do like it), this is the world many of us live in. Python3 adds a burden in this world where none existed in Python2. In exchange, there is good Unicode support, but not everyone uses that. I can't help but wonder if good Unicode support could have been added in a way that preserved Python2 convenience with Unix strings.

(Please note that I'm not making any statement as to what's appropriate to send down a TCP socket.)

reply

agentultra 1 day ago

link

ASCII by default is only an accident of history. It's going to be a slow, painful process but all human-readable text is going to be Unicode at some point. For historical reasons you'll still have to encode a vector of bytes full of character information to send it down the pipe but there's no reason why we shouldn't be explicit about it.

The pain is painful [in Python 3] primarily for library authors and only at the extremities. If you author your libraries properly your users won't even notice the difference. And in the end as more protocols and operating systems adopt better encodings for Unicode support that pain will fade (I'm looking at you, surrogateescape).

It's better to be ahead of the curve on this transition so that users of the language and our libraries won't get stuck with it. Python 2 made users have to think (or forget) about Unicode (and get it wrong every time... the shear amount of work I've put into fixing codebases that mixed bytes and unicode objects without thinking about it made me a lot of money but cost me a few years of my life I'm sure).

reply

dded 1 day ago

link

I was careful to say "Unix strings", not "ASCII". A Unix string contains no nul byte, but that's about the only rule. It's certainly not necessarily human-readable.

I don't think a programming language can take the position that an OS needs to "adopt better encodings". Python must live in the environment that the OS actually provides. It's probably a vain hope that Unix strings will vanish in anything less than decades (if ever), given the ubiquity of Unix-like systems and their 40 years of history.

I understand that Python2 does not handle Unicode well. I point out that Python3 does not handle Unix strings well. It would be good to have both.

reply

gkya 1 day ago

link

> I was careful to say "Unix strings"

This is the first time I encounter the idiom Unix strings. I'll map it to array of bytes in my table of idioms.

> I don't think a programming language can take the position that an OS needs to "adopt better encodings".

I do think that programming languages should take a position on things, including but not limited to how data is represented and interpreted in itself. A language is expected to provide some abstractions, and whether a string is an array of bytes or an array of characters is a consideration of a language designer, who will end up designing a language takes one or another of the sides available.

Python has taken the side of language user: enabled Unicode names, defaulted to Unicode strings, defaulted to classes being subclasses of the 'object' class... Unix has taken the side of machine (which was the side at the time of Unix's inception.

> [...] probably a vain hope that Unix strings will vanish [...]

If only we wait for them to vanish, doing nothing to improve.

> Python must live in the environment that the OS actually provides.

Yes, Python must indeed live in the OS' environment. Regardless, one need not be a farmer because they live among all farmers, need they?

reply

dded 1 day ago

link

> This is the first time I encounter the idiom Unix strings

The usual idiom is C-strings, but I wanted to emphasize the OS, not the language C.

>> [...] probably a vain hope that Unix strings will vanish [...] >If only we wait for them to vanish, doing nothing to improve.

The article is about the lack of Python3 adoption. In my case, Python3's poor handling of Unix/C strings is friction. It sounds like you believe that Unix/C strings can be made to go away in the near future. I do not believe this. (I'm not even certain that it's a good idea.)

reply

gkya 18 hours ago

link

I do not insist that C strings must die, I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present. I fully support strings to be Unicode-by-default in Python, as most people will put text in between double quotes, not a bunch of bytes represented by textual characters.

I do not expect C or Unix interpretations of strings to change, but I believe that they must be considered low-level and require higher-level language user to explicitly request the compiler to interpret a piece of data in such fashion.

My first name is "Göktuğ". Honestly, which one of the following is rather desirable for me, do you think?

  Python 2.7.4 (default, Sep 26 2013, 03:20:26) 
  >>> "Göktuğ"
  'G\xc3\xb6ktu\xc4\x9f'

or

  Python 3.3.1 (default, Sep 25 2013, 19:29:01) 
  >>> "Göktuğ"
  'Göktuğ'

reply

dded 16 hours ago

link

I'm not arguing against you. I just don't write any code that has to deal with people's names, so that's just not a problem that I face. I fully acknowledge that lack of Unicode is a big problem of Python2, but it's not my problem.

A Unix filename, on the other hand, might be any sort of C string. This sort of thing is all over Unix, not just filenames. (When I first ever installed Python3 at work back when 3.0 (3.1?) came out, one of the self tests failed when it tried to read an unkosher string in our /etc/passwd file.) When I code with Python2, or Perl, or C, or Emacs Lisp, I don't need to worry about these C strings. They just work.

My inquiry, somewhere up this thread, is whether or not it would be possible to solve both problems. (Perhaps by defaulting to utf-8 instead of ASCII. I don't know, I'm not a language designer.)

reply

dded 15 hours ago

link

> I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present

OK, maybe I do see one small point to argue. A C string, such as one that might be used in Unix, is not necessarily text. But text, represented as utf-8, is a C string.

It seems like there's something to leverage here, at least for those points at which Python3 interacts with the OS.

reply

--

nabla9 7 hours ago

link

UTF-8 is usually good enough in disk.

I would like to have at least two options in memory: utf-8 and vector of displayed characters (there's many combinations in use in existing modern languages with no single-character representations in UTF-<anything>).

reply

Guvante 7 hours ago

link

Do you need a vector of displayed characters?

Usually all you care about is the rendered size, which your rendering engine should be able to tell you. No need to be able to pick out those characters in most situations.

reply

nabla9 6 hours ago

link

Yes. If I want to work with language and do some stringology, that's what I want. I might want to swap some characters, find length of words etc. To have vector of characters (as what humans consider characters) is valuable.

reply

pornel 3 hours ago

link

> To have vector of characters (as what humans consider characters) is valuable.

That might be an awful can of worms. Are Arabic vowels characters? "ij" letter in Dutch? Would you separate Korean text into letters or treat each block of letters as a character?

reply

---

optimiz3 7 hours ago

link

Most of the post talks about how Windows made a poor design decision in choosing 16bit characters.

No debate there.

However, advocating "just make windows use UTF8" ignores the monumental engineering challenge and legacy back-compat issues.

In Windows most APIs have FunctionA? and FunctionW? versions, with FunctionA? meaning legacy ASCII/ANSI and FunctionW? meaning Unicode. You couldn't really fix this without adding a 3rd version that was truly UTF-8 without breaking lots of apps in subtle ways.

Likely it would also only be available to Windows 9 compatible apps if such a feature shipped.

No dev wanting to make money is going to ship software that only targets Windows 9, so the entire ask is tough to sell.

Still no debate on the theoretical merits of UTF-8 though.

reply

gamacodre 3 hours ago

link

We "solved" (worked around? hacked?) this by creating a set of FunctionU? macros and in some cases stubs that wrap all of the Windows entry points we use with incoming and outgoing converters. It's ugly under the hood and a bit slower than it needs to be, but the payoff of the app being able to consistently "think" in UTF-8 has been worth it.

Of course, we had to ditch resource-based string storage anyway for other cross-platform reasons, and were never particularly invested in the "Windows way" of doing things, so it wasn't a big shock to our developers when we made this change.

reply

angersock 7 hours ago

link

Nothing worth doing is easy.

Anyways, the FunctionA?/FunctionW? is usually hidden behind a macro anyways (for better or worse). This could simply be yet another compiler option.

reply

---

looks like ppl are down with UTF-8 as a universal default:

https://news.ycombinator.com/item?id=7070944

utf8everywhere.org

http://research.swtch.com/utf8

---

angersock 9 hours ago

link

What is currently the best way of dealing with UTF-8 strings in a cross-platform manner? It sounds like widechars and std::string just won't cut it.

reply

lmm 3 hours ago

link

A higher level language, honestly. Perl, Ruby, Python3 and Haskell have excellent crossplatform utf8 support, and I'd be amazed if OCaml didn't. But if you want to write C++ code that works on windows, you're in for some pain.

reply

---

simonster 2 days ago

link

> Another example is the byte addressing of UTF-8 strings, which may give an error if you try to index strings in the middle of a UTF-8 sequence [1]. s = "\u2200 x \u2203 y"; s[2] is an error, instead of returning the second character of the string. I find this a little awkward.

Yes, it's a little awkward, but to understand why this tradeoff was made, think about how you'd get the nth character in a UTF-8 string. There is a tradeoff between intuitive O(n) string indexing by characters and O(1) string indexing by bytes.

The way out that some programming languages have chosen is to store your strings as UTF-16, and use O(1) indexing by two-byte sequence. That's not a great solution, because 1) it takes twice as much memory to store an ASCII string and 2) if someone gives you a string that contains a Unicode character that can't be expressed in UCS-2, like 🐣, your code will either be unable to handle it at all or do the wrong thing, and you are unlikely to know that until it happens.

The other way out is to store all of your strings as UTF-32/UCS-4. I'm not sure any programming language does this, because using 4x as much memory for ASCII strings and making string manipulation significantly slower as a result (particularly for medium-sized strings that would have fit in L1 cache as UTF-8 but can't as UCS-4) is not really a great design decision.

Instead of O(n) string indexing by characters, Julia has fast string indexing by bytes with chr2ind and nextind functions to get byte indexes by character index, and iterating over strings gives 4-byte characters. Is this the appropriate tradeoff? That depends on your taste. But I don't think that additional computer science knowledge would have made this problem any easier.

reply

StefanKarpinski? 2 days ago

link

It's also essentially the same approach that has been taken by Go and Rust, so we're in pretty decent company. Rob Pike and Ken Thompson might know a little bit about UTF-8 ;-)

reply

exDM69 2 days ago

link

The problem I have with these design choices is that I predict lots of subtle off by one bugs and crashes because of non-ascii inputs in the future of Julia. I hope that I am wrong :)

> Yes, it's a little awkward, but to understand why this tradeoff was made, think about how you'd get the nth character in a UTF-8 string. There is a tradeoff between intuitive O(n) string indexing by characters and O(1) string indexing by bytes.

I understand the problem of UTF-8 character vs. byte addressing and O(n) vs. O(1) and I have thought about the problem long and hard. And I don't claim to have a "correct" solution, this is a tricky tradeoff one way or the other.

I think that Julia "does the right thing" but perhaps exposes it to the programmer in a bit funny manner that is prone to runtime errors.

> The way out that some programming languages have chosen is to store your strings as UTF-16, and use O(1) indexing by two-byte sequence.

Using UTF-16 is a horrible idea in many ways, it doesn't solve the variable width encoding problem of UTF-8 but still consumes twice the memory.

> The other way out is to store all of your strings as UTF-32/UCS-4. I'm not sure any programming language does this, because using 4x as much memory for ASCII strings and making string manipulation significantly slower as a result (particularly for medium-sized strings that would have fit in L1 cache as UTF-8 but can't as UCS-4) is not really a great design decision.

This solves the variable width encoding issue at the cost of 4x memory use. Your concern about performance and cache performance is a valid one.

However, I would like to see a comparison of some real world use case how this performs. There will be a performance hit, that is for sure but how big is it in practice?

In my opinion, the string type in a language should be targeted at short strings (long strings are some hundreds of characters, typically strings around 32 or so) and have practical operations for that. For long strings (kilobytes to megabytes) of text, another method (some kind of bytestring or "text" type) should be used. For a short string, a 4x memory use doesn't sound that bad but your point about caches is still valid.

> Instead of O(n) string indexing by characters, Julia has fast string indexing by bytes with chr2ind and nextind functions to get byte indexes by character index, and iterating over strings gives 4-byte characters. Is this the appropriate tradeoff? That depends on your taste.

This is obviously the right thing to do when you store strings in UTF-8.

My biggest concern is that there will be programs that crash when given non-ascii inputs. The biggest change I would have made is that str[n] should not throw a runtime error as long as n is within bounds.

Some options I can think of are: 1) str[n] returns n'th byte 2) str[n] returns character at n'th byte or some not-a-character value 3) Get rid of str[n] altogether and replace it with str.bytes()[n] (O(1)) and str.characters()[n] (where characters() returns some kind of lazy sequence if possible, O(n))

You're right, this boils down to a matter of taste. And my opinion is that crashing at runtime should always be avoided if it is possible by changing the design.

> But I don't think that additional computer science knowledge would have made this problem any easier.

There is a certain difference in "get things done" vs. "do it right" mentality between people who use computers for science and computer scientists. The right way to go is not in either extreme but some kind of delicate balance between the two.

reply

bayle: i dont quite understand all of the Julia stuff above, but i think s[n] should index by character for strings, not crash

---

"

Unicode

Go looooves UTF-8. It's thrilling that Go takes Unicode seriously at all in a language landscape where Unicode support ranges from tacked-on to entirely absent. Strings are all UTF-8 (unsurprisingly, given the identity of the designers). Source code files themselves are UTF-8. Moreover, the API exposes operations like type conversion in terms of large-granularity strings, as opposed to something like C or Haskell where case conversion is built atop a function that converts individual characters. Also, there is explicit support for 32 bit Unicode code points ("runes"), and converting between runes, UTF-8, and UTF16. There's a lot to like about the promise of the language with respect to Unicode.

But it's not all good. There is no case-insensitive compare (presumably, developers are expected to convert case and then compare, which is different).

Since this was written, Go added an EqualFold? function, which reports whether strings are equal under Unicode case-folding. This seems like a bizarre addition: Unicode-naïve developers looking for a case insensitive compare are unlikely to recognize EqualFold?, while Unicode-savvy developers may wonder which of the many folding algorithms you actually get. It is also unsuitable for folding tasks like a case-insensitive sort or hash table.

Furthermore, EqualFold? doesn't implement a full Unicode case insensitive compare. You can run the following code at golang.org; it ought to output true, but instead outputs false.

package main import "fmt" import "strings" func main() { fmt.Println(strings.EqualFold?("ss", "ß")) }

Bad Unicode support remains an issue in Go.

Operations like substring searching return indexes instead of ranges, which makes it difficult to handle canonically equivalent character sequences. Likewise, string comparison is based on literal byte comparisons: there is no obvious way to handle the precomposed "San José" as the same string as the decomposed "San José". These are distressing omissions.

To give a concrete example, do a case-insensitive search for "Berliner Weisse" on this page in a modern Unicode-savvy browser (sorry Firefox users), and it will correctly find the alternate spelling "Berliner Weiße", a string with a different number of characters. The Go strings package could not support this.

My enthusiasm for its Unicode support was further dampened when I exercised some of the operations it does support. For example, it doesn't properly handle the case conversions of Greek sigma (as in the name "Odysseus") or German eszett:

package main import ( "os" . "strings" ) func main() { os.Stdout.WriteString?(ToLower?("ὈΔΥΣΣΕΎΣ\n")) os.Stdout.WriteString?(ToUpper?("Weiße Elster\n")) }

This outputs "ὀδυσσεύσ" and "WEIßE? ELSTER", instead of the correct "ὀδυσσεύς" and "WEISSE ELSTER."

In fact, reading the source code it's clear that string case conversions are currently implemented in terms of individual character case conversion. For the same reason, title case is broken even for Roman characters: strings.ToTitle?("ridiculous fish") results in "RIDICULOUS FISH" instead of the correct "Ridiculous Fish." D'oh.

Go has addressed this by documenting this weirdo existing behavior and then adding a Title function that does proper title case mapping. So Title does title case mapping on a string, while ToTitle? does title case mapping on individual characters. Pretty confusing.

Unicode in Go might be summed up as good types underlying a bad API. This sounds like a reparable problem: start with a minimal incomplete string package, and fix it later. But we know from Python the confusion that results from that approach. It would be better to have a complete Unicode-savvy interface from the start, even if its implementation lags somewhat. " -- http://ridiculousfish.com/blog/posts/go_bloviations.html#go_unicode


" Moreover, there is the question of how tightly Guile and Emacs should be coupled. For one thing, the two projects currently use different internal string representations, which means that text must be decoded and encoded every time it passes in or out of the Guile interpreter. That inefficiency is certainly not ideal, but as Kastrup noted, attempting to unify the string representations is risky. Since Emacs is primarily a text editor, historically it has been forgiving about incorrectly encoded characters, in the interest of letting users get work done—it will happily convert invalid sequences into raw bytes for display purposes, then write them back as-is when saving a file.

But Guile has other use cases to worry about, such as executing programs which ought to raise an error when an invalid character sequence is encountered. Guile developer Mark H. Weaver cited passing strings into an SQL query as an example situation in which preserving "raw byte" code points could be exploited detrimentally. Weaver also expressed a desire to change Guile's internal string representation to UTF-8, as Emacs uses, but listed several unresolved sticking points that he said warranted further thought before proceeding. " -- http://lwn.net/SubscriberLink/615220/45105d9668fe1eb1/

---

"

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 9, 2014 22:34 UTC (Thu) by smurf (subscriber, #17840) [Link] UTF-8 has one disadvantage: It's slightly more complex to find the n'th-next (or previous) character, which is important to the speed of pattern matching in some cases.

However, it has the distinct advantage that your large ASCII text does not suddenly need eight times the storage space just because you insert a character with a smiling kitty face.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 9, 2014 22:49 UTC (Thu) by mjg59 (subscriber, #23239) [Link] Combining characters mean you're going to take a hit with Unicode whatever the representation.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 10, 2014 16:08 UTC (Fri) by lambda (subscriber, #40735) [Link]

Except, when pattern matching UTF-8, you can generally just match on the bytes (code units) directly, rather than on the characters (codepoints); the algorithms that need to skip ahead by a fixed n characters are generally the exact string matching algorithms like Boyer-Moore and Knuth-Morris-Pratt. There's no reason to require that those be run on the codepoints instead of on the bytes.

If you're doing regular expression matching with Unicode data, even if you use UTF-32, you will need to consume variable length strings as single characters, as you can have decomposed characters that need to match as a single character.

People always bring up lack of constant codepoint indexing when UTF-8 is mentioned, but I have never seen an example in which you actually need to index by codepoint, that doesn't either break in the face of other issues like combining sequences, or can't be solved by just using code unit indexing.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 12, 2014 6:12 UTC (Sun) by k8to (subscriber, #15413) [Link] This view dates back to a time when UCS-2 was fixed size (whatever its name was then) or then when the predecessor for UTF32 was fixed size. As you point out, both of those eras passed.

It's a little more tedious to CUT a UTF8 string safely based on a size computed in bytes than in some other encodings, but not much more, and that's very rarely a fast path. " -- http://lwn.net/SubscriberLink/615220/45105d9668fe1eb1/

---

i guess we want immutable, Pascal-style strings. That's what Java has; that's what Go has; Python has something similar (Python has a more optimized thingee that chooses between multiple unicode representations based on what is more efficient for this particular string).

---

"

benreic 3 hours ago

link

Quick link to what I think is the most interesting class in the CLR:

https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/String.cs " -- https://news.ycombinator.com/item?id=8992139

---

simfoo 3 hours ago

link

From their GetHashCode?():

We want to ensure we can change our hash function daily.

This is perfectly fine as long as you don't persist the

value from GetHashCode? to disk or count on String A

hashing before string B. Those are bugs in your code.

hash1 ^= ThisAssembly?.DailyBuildNumber?;

I'd love to hear the story behind this one :D

reply

daeken 2 hours ago

link

I don't know the story, but the logic behind it is simple: If you want to guarantee no one depends on GetHashCode? staying static between runs of an application, change it all the time.

reply

---

DoggettCK? 1 hour ago

link

That does answer an unanswered question I had on SO about string hashing. If strings are immutable, why isn't the hash code memoized? Seems like it would make HashSet?/Dictionary lookups using string keys much faster.

reply

munificent 53 minutes ago

link

Presumably because it's not worth the memory hit to store the hash.

reply

DoggettCK? 44 minutes ago

link

That was my assumption, too. They do memoize the length, but I'm sure those bytes add up, having run into OutOfMemoryExceptions? building huge amounts of strings before.

reply

reddiric 31 minutes ago

link

Strings know their length in the CLR because they are represented as BSTRs

http://blogs.msdn.com/b/ericlippert/archive/2011/07/19/strin...

This lets them interoperate with OLE Automation.

reply

---