Bayle Shanks's website: proj-oot-ootStringNotes1

upvote

rspeer 1 day ago

link

The thing I'm looking forward to in Python 3.4 is that you should be able to follow the wise advice about how to handle text in the modern era:

"Text is always Unicode. Read it in as UTF-8. Write it out as UTF-8. Everything in between just works."

This was not true up through 3.2, because Unicode in Python <= 3.2 was an abstraction that leaked some very unfortunate implementation details. There was the chance that you were on a "narrow build" of Python, where Unicode characters in memory were fixed to be two bytes long, so you couldn't perform most operations on characters outside the Basic Multilingual Plane. You could kind of fake it sometimes, but it meant you had to be thinking about "okay, how is this text really represented in memory" all the time, and explicitly coding around the fact that two different installations of Python with the same version number have different behavior.

Python 3.3 switched to a flexible string representation that eliminated the need for narrow and wide builds. However, operations in this representation weren't tested well enough for non-BMP characters, so running something like text.lower() on arbitrary text could now give you a SystemError? (http://bugs.python.org/issue18183).

With that bug fixed in Python 3.4, that removes the last thing I know of standing in the way of Unicode just working.

http://mortoray.com/2013/11/27/the-string-type-is-broken/

https://news.ycombinator.com/item?id=6807524

---

" agentultra 2 days ago

link

...

> Default unicode strings are obscenely annoying to me. Almost all of my code deals with binary data, parsing complex data structures, etc. The only "human readable" strings in my code are logs. Why the hell should I worry about text encoding before sending a string into a TCP socket...

> The fact that the combination of words "encode" and "UTF8" appear in my code, and str.encode('hex') is no longer available, is a very good representation of why I hate Python 3.

I'm afraid I don't understand your complaint. If you're parsing binary data then Python 3 is clearly superior than Python 2:

    >>> "Hello, Gådel".encode("utf-8")
    b'Hello, G\xc3\xa5del'

Seems much more reasonable than:

    >>> "Hello, Gådel".encode("utf-8")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)

Because they're not the same thing. Python 2 would implicitly "promote" a bytestring (the default literal) to a unicode object so long as it contained ASCII bytes. Of course this gets really tiresome and leads to Python 2's, "unicode dance." Armin seems to prefer it to the extra leg-work for correct unicode handling in Python 3 [0] however I think the trade-off is worth it and that pain will fade when the wider world catches up.

http://lucumr.pocoo.org/2013/7/2/the-updated-guide-to-unicode/

...

If your Python 3 code is dealing with binary data, you would use byte strings and you would never have to call encode or touch UTF-8 before passing the byte string to a TCP socket.

What you're saying about Unicode scares me. If you haven't already, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) before writing any libraries that I might depend on.

keyme 2 days ago

link

> you would use byte strings and you would never have to call encode or touch UTF-8 before passing the byte string to a TCP socket.

I'll start by adding that it's also incredibly annoying to declare each string to be a byte string, if this wasn't clear from my original rant.

Additionally, your advice is broken. Take a look at this example, with the shelve module (from standard library).

s=shelve.open('/tmp/a')

s[b'key']=1

Results in:

AttributeError?: 'bytes' object has no attribute 'encode'

So in this case, my byte string can't be used as a key here, apparently. Of course, a string is expected, and this isn't really a string. My use case was trying to use a binary representation of a hash as the key here. What's more natural than that. Could easily do that in Python 2. Not so easy now.

I can find endless examples for this, so your advice about "just using byte strings" is invalid. Conversions are inevitable. And this annoys me.

> What you're saying about Unicode scares me.

Yeah, I know full well what you're scared of. If I'm designing everything from scratch, using Unicode properly is easy. This, however, is not the case when implementing existing protocols, or reading file formats that don't use Unicode. That's where things begin being annoying when your strings are no longer strings.

deathanatos 2 days ago

link

> Additionally, your advice is broken. Take a look at this example, with the shelve module (from standard library).

His advice was sound, and referred to your example of TCP stream data (which is binary). Your example regards the shelve library.

> So in this case, my byte string can't be used as a key here, apparently.

shelve requires strings as a keys. This is documented, though not particularly clearly.

> so your advice about "just using byte strings" is invalid.

Let me attempt to rephrase his advice. Use bytestrings where your data is a string of bytes. Use strings where your data is human-readable text. Covert to and from bytestrings when serializing to something like network or storage.

> Conversions are inevitable.

Absolutely, because bytestrings and (text)strings are two different types.

> And this annoys me.

There is no real alternative though, because there is no way to automatically convert between the two. Python 2 made many assumptions, and these were often invalid and led to bugs. Python 3 does not; in places where it does not have the required information, you must provide it.

> when implementing existing protocols, or reading file formats that don't use Unicode.

I'm afraid it's still the same. A protocol that uses unicode requires you to code something like "decode('utf-8')" (if UTF-8 is what it uses), one that does not requires "decode('whatever-it-uses-instead')". If it isn't clear what encoding the file format or protocol stores textual data in, then that's a bug with the file format or protocol, not Python. Regardless though, Python doesn't know (and can't know) what encoding the file or protocol uses.

tungwaiyip 1 day ago

link

Keyme, you and I are the few that have serious concern with the design of Python 3. I started to embrace it in a big way 6 months ago when most libraries I use are available in Python. I wish to say the wait is over and we should all move to Python 3 then everything will be great. Instead I find no compelling advantage. Maybe there will be when I start to use unicode string more. Instead I'm really annoyed by the default iterator and the binary string handling. I am afraid it is not a change for the good.

I come from the Java world when people take a lot of care to implement things as streams. It was initially shocking to see Python read an entire file into memory, turn it into list or other data structure with no regard to memeory usage. Then I have learned this work perfectly well when you have a small input, a few MB or so is a piece of cake for modern computer. It takes all the hassle out of setting up streams in Java. You optimize when you need to. But for 90% of stuff, a materialized list works perfectly well.

Now Python become more like Java in this respect. I can't do exploratory programming easily without adding list(). Many times I run into problem when I am building complex data structure like list of list, and end up getting a list of iterator. It takes the conciseness out of Python when I am forced to deal with iterator and to materialize the data.

The other big problem is the binary string. Binary string handling is one of the great feaute of Python. It it so much more friendly to manipulate binary data in Python compare to C or Java. In Python 3, it is pretty much broken. It would be an easy transition I only need to add a 'b' prefix to specify it as binary string literal. But in fact, the operation on binary string is so different from regular string that it is just broken.

  In [38]: list('abc')
  Out[38]: ['a', 'b', 'c']
  
  In [37]: list(b'abc')       # string become numbers??
  Out[37]: [97, 98, 99]
  
  In [43]: ''.join('abc')
  Out[43]: 'abc'
  
  In [44]: ''.join(b'abc')    # broken, no easy way to join them back into string
  ---------------------------------------------------------------------------
  TypeError                                 Traceback (most recent call last)
  <ipython-input-44-fcdbf85649d1> in <module>()
  ----> 1 ''.join(b'abc')
  
  TypeError: sequence item 0: expected str instance, int found

keyme 1 day ago

link

Yes! Thank you.

All the other commenters here that are explaining things like using a list() in order to print out an iterator are missing the point entirely.

The issue is "discomfort". Of course you can write code that makes everything work again. This isn't the issue. It's just not "comfortable". This is a major step backwards in a language that is used 50% of the time in an interactive shell (well, at least for some of us).

agentultra 1 day ago

link

The converse problem is having to write iterator versions of map, filter, and other eagerly-evaluated builtins. You can't just write:

    >>> t = iter(map(lambda x: x * x, xs))

Because the map() call is eagerly evaluated. It's much easier to exhaust an iterator in the list constructor and leads to a consistent iteration API throughout the language.

If that makes your life hard then I feel sorry for you, son. I've got 99 problems but a list constructor ain't one.

The Python 3 bytes object is not intended to be the same as the Python 2 str object. They're completely separate concepts. Any comparison is moot.

Think of the bytes object as a dynamic char[] and you'll be less inclined to confusion and anger:

    >>> list(b'abc')
    [97, 98, 99]

That's not a list of numbers... that's a list of bytes!

    >>> "".join(map(lambda byte: chr(byte), b'abc'))
    'abc'

And you get a string!

tungwaiyip 13 hours ago

link

What you are looking for is imap(). In Python 2 there are entire collection of iterator variants. You can choose to use either the list or the iterator variants.

The problem with Python 3 is the list version is removed. You are forced to use iterator all the time. Things become inconvenient and ugly as a result. Bugs are regularly introduced because I forget to apply list().

  >>> "".join(map(lambda byte: chr(byte), b'abc'))

Compares to .join('abc'), this is what I call fuck'd. Luckily maxerickson suggested a better method.

dded 15 hours ago

link

  >    >>> list(b'abc')
  >    [97, 98, 99]
  >That's not a list of numbers... that's a list of bytes!

No, it's a list of numbers:

  >>> type(list(b'abc')[0])
  <class 'int'>

I think the GP mis-typed his last example. First, he showed that .join('abc') takes a string, busts it up, then concatenates it back to a string. Then, with .join(b'abc'), he appears to want to bust up a byte string and concatenate it back to a text string. But I suspect he meant to type this:

  >>> b''.join(b'abc')

That is, bust up a byte string and concatenate back to what you start with: a byte string. But that doesn't work, when you bust up a byte string you get a list of ints; and you cannot concatenate them back to a byte string (at least not elegantly).

itsadok 1 day ago

link

> The converse problem is having to write iterator versions of map, filter, and other eagerly-evaluated builtins

Well, in Python 2 you just use imap instead of map. That way you have both options, and you can be explicit rather than implicit.

> That's not a list of numbers... that's a list of bytes!

The point being made here is not that some things are not possible in Python 3, but rather than things that are natural in Python 2 are ugly in 3. I believe you're proving the point here. The idea that b'a'[0] == 97 in such a fundamental way that I might get one when I expected the other may be fine in C, but I hold Python to a higher standard.

maxerickson 1 day ago

link

  >>> bytes(list(b'abc'))
  b'abc'
  >>>

That is, the way to turn a list of ints into a byte string is to pass it to the bytes object.

(This narrowly addresses that concern, I'd readily concede that the new API is going to have situations where it is worse)

tungwaiyip 14 hours ago

link

Thank you. This is good to know. I was rather frustrated to find binary data handling being changed with no easy translation in Python 3.

Here is another annoyance:

    In [207]: 'abc'[0] + 'def'
    Out[207]: 'adef'

    In [208]: b'abc'[0] + b'def'
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-208-458c625ec231> in <module>()
    ----> 1 b'abc'[0] + b'def'

    TypeError: unsupported operand type(s) for +: 'int' and 'bytes'

maxerickson 7 hours ago

link

I don't have enough experience with either version to debate the merits of the choice, but the way forward with python 3 is to think of bytes objects as more like special lists of ints, where if you want a slice (instead of a single element) you have to ask for it:

    >>> [1,2,3][0]
    1
    >>> [1,2,3][0:1]
    [1]
    >>> b'abc'[0]
    97
    >>> b'abc'[0:1]
    b'a'
    >>>

So the construction you want is just:

    >>> b'abc'[0:1]+b'def'
    b'adef'

Which is obviously worse if you are doing it a bunch of times, but it is at least coherent with the view that bytes are just collections of ints (and there are situations where indexing operations returning an int is going to be more useful).

tungwaiyip 4 hours ago

link

In Java, String and Char are two separate types. In Python, there is no separate char type. It is simply a string of length of 1. I do not have great theory to show which design is better either. I can only say the Python design work great for me in the past (for both text and binary string), and I suspect it is the more user friendly design of the two.

So in Python 3 the design of binary string is changed. Unlike the old string, bytes and binary string of length 1 are not the same. Working codes are broken, practice have to be changed, often it involves more complicated code (like [0] becomes [0:1]). All these happens with no apparent benefit other than it is more "coherent" in the eye of some people. This is the frustration I see after using Python 3 for some time.

keyme 2 days ago

link

If I recall, writing no boilerplate code was a big deal in python once...

And while 2 lines are not worth my rant, writing those 2 lines again and again all the time, is.

baq 1 day ago

link

i'd argue this is not boilerplate, more like a shortcut for your particular use case:

    import codecs
    enc = lambda x: codecs.encode(x, 'hex')

i have a program in python 2 that uses this approach, because i have a lot of decoding from utf and encoding to a different charset to do. python 3 is absolutely the same for me.

gkya 1 day ago

link

> Why the hell should I worry about text encoding before sending a string into a TCP socket...

A string represents a snippet of human readable text and is not merely an array of bytes in a sane world. Thus is it fine & sane to have to encode a string before sticking it into a socket, as sockets are used to transfer bytes from point a to b, not text.

dded 1 day ago

link

Not arguing that you're wrong, but Unix/Linux is not a sane world by your definition. Whether we like it or not (I do like it), this is the world many of us live in. Python3 adds a burden in this world where none existed in Python2. In exchange, there is good Unicode support, but not everyone uses that. I can't help but wonder if good Unicode support could have been added in a way that preserved Python2 convenience with Unix strings.

(Please note that I'm not making any statement as to what's appropriate to send down a TCP socket.)

agentultra 1 day ago

link

ASCII by default is only an accident of history. It's going to be a slow, painful process but all human-readable text is going to be Unicode at some point. For historical reasons you'll still have to encode a vector of bytes full of character information to send it down the pipe but there's no reason why we shouldn't be explicit about it.

The pain is painful [in Python 3] primarily for library authors and only at the extremities. If you author your libraries properly your users won't even notice the difference. And in the end as more protocols and operating systems adopt better encodings for Unicode support that pain will fade (I'm looking at you, surrogateescape).

It's better to be ahead of the curve on this transition so that users of the language and our libraries won't get stuck with it. Python 2 made users have to think (or forget) about Unicode (and get it wrong every time... the shear amount of work I've put into fixing codebases that mixed bytes and unicode objects without thinking about it made me a lot of money but cost me a few years of my life I'm sure).

dded 1 day ago

link

I was careful to say "Unix strings", not "ASCII". A Unix string contains no nul byte, but that's about the only rule. It's certainly not necessarily human-readable.

I don't think a programming language can take the position that an OS needs to "adopt better encodings". Python must live in the environment that the OS actually provides. It's probably a vain hope that Unix strings will vanish in anything less than decades (if ever), given the ubiquity of Unix-like systems and their 40 years of history.

I understand that Python2 does not handle Unicode well. I point out that Python3 does not handle Unix strings well. It would be good to have both.

gkya 1 day ago

link

> I was careful to say "Unix strings"

This is the first time I encounter the idiom Unix strings. I'll map it to array of bytes in my table of idioms.

> I don't think a programming language can take the position that an OS needs to "adopt better encodings".

I do think that programming languages should take a position on things, including but not limited to how data is represented and interpreted in itself. A language is expected to provide some abstractions, and whether a string is an array of bytes or an array of characters is a consideration of a language designer, who will end up designing a language takes one or another of the sides available.

Python has taken the side of language user: enabled Unicode names, defaulted to Unicode strings, defaulted to classes being subclasses of the 'object' class... Unix has taken the side of machine (which was the side at the time of Unix's inception.

> [...] probably a vain hope that Unix strings will vanish [...]

If only we wait for them to vanish, doing nothing to improve.

> Python must live in the environment that the OS actually provides.

Yes, Python must indeed live in the OS' environment. Regardless, one need not be a farmer because they live among all farmers, need they?

dded 1 day ago

link

> This is the first time I encounter the idiom Unix strings

The usual idiom is C-strings, but I wanted to emphasize the OS, not the language C.

>> [...] probably a vain hope that Unix strings will vanish [...] >If only we wait for them to vanish, doing nothing to improve.

The article is about the lack of Python3 adoption. In my case, Python3's poor handling of Unix/C strings is friction. It sounds like you believe that Unix/C strings can be made to go away in the near future. I do not believe this. (I'm not even certain that it's a good idea.)

gkya 18 hours ago

link

I do not insist that C strings must die, I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present. I fully support strings to be Unicode-by-default in Python, as most people will put text in between double quotes, not a bunch of bytes represented by textual characters.

I do not expect C or Unix interpretations of strings to change, but I believe that they must be considered low-level and require higher-level language user to explicitly request the compiler to interpret a piece of data in such fashion.

My first name is "Göktuğ". Honestly, which one of the following is rather desirable for me, do you think?

  Python 2.7.4 (default, Sep 26 2013, 03:20:26) 
  >>> "Göktuğ"
  'G\xc3\xb6ktu\xc4\x9f'

  Python 3.3.1 (default, Sep 25 2013, 19:29:01) 
  >>> "Göktuğ"
  'Göktuğ'

dded 16 hours ago

link

I'm not arguing against you. I just don't write any code that has to deal with people's names, so that's just not a problem that I face. I fully acknowledge that lack of Unicode is a big problem of Python2, but it's not my problem.

A Unix filename, on the other hand, might be any sort of C string. This sort of thing is all over Unix, not just filenames. (When I first ever installed Python3 at work back when 3.0 (3.1?) came out, one of the self tests failed when it tried to read an unkosher string in our /etc/passwd file.) When I code with Python2, or Perl, or C, or Emacs Lisp, I don't need to worry about these C strings. They just work.

My inquiry, somewhere up this thread, is whether or not it would be possible to solve both problems. (Perhaps by defaulting to utf-8 instead of ASCII. I don't know, I'm not a language designer.)

dded 15 hours ago

link

> I insist that C strings are indeed arrays of bytes, and we cannot use them to represent text correctly at present

OK, maybe I do see one small point to argue. A C string, such as one that might be used in Unix, is not necessarily text. But text, represented as utf-8, is a C string.

It seems like there's something to leverage here, at least for those points at which Python3 interacts with the OS.

nabla9 7 hours ago

link

UTF-8 is usually good enough in disk.

I would like to have at least two options in memory: utf-8 and vector of displayed characters (there's many combinations in use in existing modern languages with no single-character representations in UTF-<anything>).

Guvante 7 hours ago

link

Do you need a vector of displayed characters?

Usually all you care about is the rendered size, which your rendering engine should be able to tell you. No need to be able to pick out those characters in most situations.

nabla9 6 hours ago

link

Yes. If I want to work with language and do some stringology, that's what I want. I might want to swap some characters, find length of words etc. To have vector of characters (as what humans consider characters) is valuable.

pornel 3 hours ago

link

> To have vector of characters (as what humans consider characters) is valuable.

That might be an awful can of worms. Are Arabic vowels characters? "ij" letter in Dutch? Would you separate Korean text into letters or treat each block of letters as a character?

---

optimiz3 7 hours ago

link

Most of the post talks about how Windows made a poor design decision in choosing 16bit characters.

No debate there.

However, advocating "just make windows use UTF8" ignores the monumental engineering challenge and legacy back-compat issues.

In Windows most APIs have FunctionA? and FunctionW? versions, with FunctionA? meaning legacy ASCII/ANSI and FunctionW? meaning Unicode. You couldn't really fix this without adding a 3rd version that was truly UTF-8 without breaking lots of apps in subtle ways.

Likely it would also only be available to Windows 9 compatible apps if such a feature shipped.

No dev wanting to make money is going to ship software that only targets Windows 9, so the entire ask is tough to sell.

Still no debate on the theoretical merits of UTF-8 though.

gamacodre 3 hours ago

link

We "solved" (worked around? hacked?) this by creating a set of FunctionU? macros and in some cases stubs that wrap all of the Windows entry points we use with incoming and outgoing converters. It's ugly under the hood and a bit slower than it needs to be, but the payoff of the app being able to consistently "think" in UTF-8 has been worth it.

Of course, we had to ditch resource-based string storage anyway for other cross-platform reasons, and were never particularly invested in the "Windows way" of doing things, so it wasn't a big shock to our developers when we made this change.

angersock 7 hours ago

link

Nothing worth doing is easy.

Anyways, the FunctionA?/FunctionW? is usually hidden behind a macro anyways (for better or worse). This could simply be yet another compiler option.

---

looks like ppl are down with UTF-8 as a universal default:

https://news.ycombinator.com/item?id=7070944

utf8everywhere.org

http://research.swtch.com/utf8

---

angersock 9 hours ago

link

What is currently the best way of dealing with UTF-8 strings in a cross-platform manner? It sounds like widechars and std::string just won't cut it.

lmm 3 hours ago

link

A higher level language, honestly. Perl, Ruby, Python3 and Haskell have excellent crossplatform utf8 support, and I'd be amazed if OCaml didn't. But if you want to write C++ code that works on windows, you're in for some pain.

---

simonster 2 days ago

link

> Another example is the byte addressing of UTF-8 strings, which may give an error if you try to index strings in the middle of a UTF-8 sequence [1]. s = "\u2200 x \u2203 y"; s[2] is an error, instead of returning the second character of the string. I find this a little awkward.

Yes, it's a little awkward, but to understand why this tradeoff was made, think about how you'd get the nth character in a UTF-8 string. There is a tradeoff between intuitive O(n) string indexing by characters and O(1) string indexing by bytes.

The way out that some programming languages have chosen is to store your strings as UTF-16, and use O(1) indexing by two-byte sequence. That's not a great solution, because 1) it takes twice as much memory to store an ASCII string and 2) if someone gives you a string that contains a Unicode character that can't be expressed in UCS-2, like 🐣, your code will either be unable to handle it at all or do the wrong thing, and you are unlikely to know that until it happens.

The other way out is to store all of your strings as UTF-32/UCS-4. I'm not sure any programming language does this, because using 4x as much memory for ASCII strings and making string manipulation significantly slower as a result (particularly for medium-sized strings that would have fit in L1 cache as UTF-8 but can't as UCS-4) is not really a great design decision.

Instead of O(n) string indexing by characters, Julia has fast string indexing by bytes with chr2ind and nextind functions to get byte indexes by character index, and iterating over strings gives 4-byte characters. Is this the appropriate tradeoff? That depends on your taste. But I don't think that additional computer science knowledge would have made this problem any easier.

StefanKarpinski? 2 days ago

link

It's also essentially the same approach that has been taken by Go and Rust, so we're in pretty decent company. Rob Pike and Ken Thompson might know a little bit about UTF-8 ;-)

exDM69 2 days ago

link

The problem I have with these design choices is that I predict lots of subtle off by one bugs and crashes because of non-ascii inputs in the future of Julia. I hope that I am wrong :)

> Yes, it's a little awkward, but to understand why this tradeoff was made, think about how you'd get the nth character in a UTF-8 string. There is a tradeoff between intuitive O(n) string indexing by characters and O(1) string indexing by bytes.

I understand the problem of UTF-8 character vs. byte addressing and O(n) vs. O(1) and I have thought about the problem long and hard. And I don't claim to have a "correct" solution, this is a tricky tradeoff one way or the other.

I think that Julia "does the right thing" but perhaps exposes it to the programmer in a bit funny manner that is prone to runtime errors.

> The way out that some programming languages have chosen is to store your strings as UTF-16, and use O(1) indexing by two-byte sequence.

Using UTF-16 is a horrible idea in many ways, it doesn't solve the variable width encoding problem of UTF-8 but still consumes twice the memory.

> The other way out is to store all of your strings as UTF-32/UCS-4. I'm not sure any programming language does this, because using 4x as much memory for ASCII strings and making string manipulation significantly slower as a result (particularly for medium-sized strings that would have fit in L1 cache as UTF-8 but can't as UCS-4) is not really a great design decision.

This solves the variable width encoding issue at the cost of 4x memory use. Your concern about performance and cache performance is a valid one.

However, I would like to see a comparison of some real world use case how this performs. There will be a performance hit, that is for sure but how big is it in practice?

In my opinion, the string type in a language should be targeted at short strings (long strings are some hundreds of characters, typically strings around 32 or so) and have practical operations for that. For long strings (kilobytes to megabytes) of text, another method (some kind of bytestring or "text" type) should be used. For a short string, a 4x memory use doesn't sound that bad but your point about caches is still valid.

> Instead of O(n) string indexing by characters, Julia has fast string indexing by bytes with chr2ind and nextind functions to get byte indexes by character index, and iterating over strings gives 4-byte characters. Is this the appropriate tradeoff? That depends on your taste.

This is obviously the right thing to do when you store strings in UTF-8.

My biggest concern is that there will be programs that crash when given non-ascii inputs. The biggest change I would have made is that str[n] should not throw a runtime error as long as n is within bounds.

Some options I can think of are: 1) str[n] returns n'th byte 2) str[n] returns character at n'th byte or some not-a-character value 3) Get rid of str[n] altogether and replace it with str.bytes()[n] (O(1)) and str.characters()[n] (where characters() returns some kind of lazy sequence if possible, O(n))

You're right, this boils down to a matter of taste. And my opinion is that crashing at runtime should always be avoided if it is possible by changing the design.

> But I don't think that additional computer science knowledge would have made this problem any easier.

There is a certain difference in "get things done" vs. "do it right" mentality between people who use computers for science and computer scientists. The right way to go is not in either extreme but some kind of delicate balance between the two.

bayle: i dont quite understand all of the Julia stuff above, but i think s[n] should index by character for strings, not crash

---

Unicode

Go looooves UTF-8. It's thrilling that Go takes Unicode seriously at all in a language landscape where Unicode support ranges from tacked-on to entirely absent. Strings are all UTF-8 (unsurprisingly, given the identity of the designers). Source code files themselves are UTF-8. Moreover, the API exposes operations like type conversion in terms of large-granularity strings, as opposed to something like C or Haskell where case conversion is built atop a function that converts individual characters. Also, there is explicit support for 32 bit Unicode code points ("runes"), and converting between runes, UTF-8, and UTF16. There's a lot to like about the promise of the language with respect to Unicode.

But it's not all good. There is no case-insensitive compare (presumably, developers are expected to convert case and then compare, which is different).

Since this was written, Go added an EqualFold? function, which reports whether strings are equal under Unicode case-folding. This seems like a bizarre addition: Unicode-naïve developers looking for a case insensitive compare are unlikely to recognize EqualFold?, while Unicode-savvy developers may wonder which of the many folding algorithms you actually get. It is also unsuitable for folding tasks like a case-insensitive sort or hash table.

Furthermore, EqualFold? doesn't implement a full Unicode case insensitive compare. You can run the following code at golang.org; it ought to output true, but instead outputs false.

package main import "fmt" import "strings" func main() { fmt.Println(strings.EqualFold?("ss", "ß")) }

Bad Unicode support remains an issue in Go.

Operations like substring searching return indexes instead of ranges, which makes it difficult to handle canonically equivalent character sequences. Likewise, string comparison is based on literal byte comparisons: there is no obvious way to handle the precomposed "San José" as the same string as the decomposed "San José". These are distressing omissions.

To give a concrete example, do a case-insensitive search for "Berliner Weisse" on this page in a modern Unicode-savvy browser (sorry Firefox users), and it will correctly find the alternate spelling "Berliner Weiße", a string with a different number of characters. The Go strings package could not support this.

My enthusiasm for its Unicode support was further dampened when I exercised some of the operations it does support. For example, it doesn't properly handle the case conversions of Greek sigma (as in the name "Odysseus") or German eszett:

package main import ( "os" . "strings" ) func main() { os.Stdout.WriteString?(ToLower?("ὈΔΥΣΣΕΎΣ\n")) os.Stdout.WriteString?(ToUpper?("Weiße Elster\n")) }

This outputs "ὀδυσσεύσ" and "WEIßE? ELSTER", instead of the correct "ὀδυσσεύς" and "WEISSE ELSTER."

In fact, reading the source code it's clear that string case conversions are currently implemented in terms of individual character case conversion. For the same reason, title case is broken even for Roman characters: strings.ToTitle?("ridiculous fish") results in "RIDICULOUS FISH" instead of the correct "Ridiculous Fish." D'oh.

Go has addressed this by documenting this weirdo existing behavior and then adding a Title function that does proper title case mapping. So Title does title case mapping on a string, while ToTitle? does title case mapping on individual characters. Pretty confusing.

Unicode in Go might be summed up as good types underlying a bad API. This sounds like a reparable problem: start with a minimal incomplete string package, and fix it later. But we know from Python the confusion that results from that approach. It would be better to have a complete Unicode-savvy interface from the start, even if its implementation lags somewhat. " -- http://ridiculousfish.com/blog/posts/go_bloviations.html#go_unicode

" Moreover, there is the question of how tightly Guile and Emacs should be coupled. For one thing, the two projects currently use different internal string representations, which means that text must be decoded and encoded every time it passes in or out of the Guile interpreter. That inefficiency is certainly not ideal, but as Kastrup noted, attempting to unify the string representations is risky. Since Emacs is primarily a text editor, historically it has been forgiving about incorrectly encoded characters, in the interest of letting users get work done—it will happily convert invalid sequences into raw bytes for display purposes, then write them back as-is when saving a file.

But Guile has other use cases to worry about, such as executing programs which ought to raise an error when an invalid character sequence is encountered. Guile developer Mark H. Weaver cited passing strings into an SQL query as an example situation in which preserving "raw byte" code points could be exploited detrimentally. Weaver also expressed a desire to change Guile's internal string representation to UTF-8, as Emacs uses, but listed several unresolved sticking points that he said warranted further thought before proceeding. " -- http://lwn.net/SubscriberLink/615220/45105d9668fe1eb1/

---

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 9, 2014 22:34 UTC (Thu) by smurf (subscriber, #17840) [Link] UTF-8 has one disadvantage: It's slightly more complex to find the n'th-next (or previous) character, which is important to the speed of pattern matching in some cases.

However, it has the distinct advantage that your large ASCII text does not suddenly need eight times the storage space just because you insert a character with a smiling kitty face.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 9, 2014 22:49 UTC (Thu) by mjg59 (subscriber, #23239) [Link] Combining characters mean you're going to take a hit with Unicode whatever the representation.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 10, 2014 16:08 UTC (Fri) by lambda (subscriber, #40735) [Link]

Except, when pattern matching UTF-8, you can generally just match on the bytes (code units) directly, rather than on the characters (codepoints); the algorithms that need to skip ahead by a fixed n characters are generally the exact string matching algorithms like Boyer-Moore and Knuth-Morris-Pratt. There's no reason to require that those be run on the codepoints instead of on the bytes.

If you're doing regular expression matching with Unicode data, even if you use UTF-32, you will need to consume variable length strings as single characters, as you can have decomposed characters that need to match as a single character.

People always bring up lack of constant codepoint indexing when UTF-8 is mentioned, but I have never seen an example in which you actually need to index by codepoint, that doesn't either break in the face of other issues like combining sequences, or can't be solved by just using code unit indexing.

The future of Emacs, Guile, and Emacs Lisp

Posted Oct 12, 2014 6:12 UTC (Sun) by k8to (subscriber, #15413) [Link] This view dates back to a time when UCS-2 was fixed size (whatever its name was then) or then when the predecessor for UTF32 was fixed size. As you point out, both of those eras passed.

It's a little more tedious to CUT a UTF8 string safely based on a size computed in bytes than in some other encodings, but not much more, and that's very rarely a fast path. " -- http://lwn.net/SubscriberLink/615220/45105d9668fe1eb1/

---

i guess we want immutable, Pascal-style strings. That's what Java has; that's what Go has; Python has something similar (Python has a more optimized thingee that chooses between multiple unicode representations based on what is more efficient for this particular string).

---

benreic 3 hours ago

link

Quick link to what I think is the most interesting class in the CLR:

https://github.com/dotnet/coreclr/blob/master/src/mscorlib/src/System/String.cs " -- https://news.ycombinator.com/item?id=8992139

---

simfoo 3 hours ago

link

From their GetHashCode?():

We want to ensure we can change our hash function daily.

This is perfectly fine as long as you don't persist the

value from GetHashCode? to disk or count on String A

hashing before string B. Those are bugs in your code.

hash1 ^= ThisAssembly?.DailyBuildNumber?;

I'd love to hear the story behind this one :D

daeken 2 hours ago

link

I don't know the story, but the logic behind it is simple: If you want to guarantee no one depends on GetHashCode? staying static between runs of an application, change it all the time.

---

DoggettCK? 1 hour ago

link

That does answer an unanswered question I had on SO about string hashing. If strings are immutable, why isn't the hash code memoized? Seems like it would make HashSet?/Dictionary lookups using string keys much faster.

munificent 53 minutes ago

link

Presumably because it's not worth the memory hit to store the hash.

DoggettCK? 44 minutes ago

link

That was my assumption, too. They do memoize the length, but I'm sure those bytes add up, having run into OutOfMemoryExceptions? building huge amounts of strings before.

reddiric 31 minutes ago

link

Strings know their length in the CLR because they are represented as BSTRs

http://blogs.msdn.com/b/ericlippert/archive/2011/07/19/strin...

This lets them interoperate with OLE Automation.

---

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 is like UTF-8 except that NULL is encoded using something other than 0, and 0 never appears in the bytestream.

https://news.ycombinator.com/item?id=3906755 says that various external programs fail to handle NULL well.

if NULL is not handled well, there can be bugs, even security bugs:

http://hakipedia.com/index.php/Poison_Null_Byte

so mb we should use modified UTF-8 rather than true UTF-8, by default

---

jeez Unicode sounds like such a mess:

https://gojko.net/2017/11/07/five-things-about-unicode.html

" 1. Many Unicode points are not visible

Unicode has several zero-width code points, for example the zero-width joiner (U+200D) and the zero-width non-joiner (U+200C), which are hints for hyphenation tools. They have no visible effect on screen appearance, but they still affect string comparison, which is why the WhatApp? scammers were able to pass undetected for so long. Most of these characters are in the general punctuation block (from U+2000 to U+206F). There’s generally no justification for allowing anyone to use code points from that block in identifiers, so they are least easy to filter. However, there are some other special codes outside that range that are invisible, such as the Mongolian Vowel Separator (U+180E).

In general, it’s dangerous to do simple string comparisons for uniqueness constraints with Unicode. A potential workaround is to limit the character sets allowed for identifiers and any other pieces of data which could be abused by scammers. Unfortunately, that’s not a full solution to the problem.

2. Many code points look very similar

... An amazing abuse of this problem is Mimic, a fun utility that replaces common symbols used in software development, such as colons and semi-colons, with similarly-looking Unicode characters. ...

Fancifully called homomorphic attacks, these exploits can cause serious security issues. In April 2017, a security researcher was able to register a domain that looked very similar to apple.com and even get a SSL certificate for it, by mixing letters from different character sets.Similar to mixing visible and non-visible characters, there’s rarely any justification for allowing mixed character set names to be used in identifiers, especially domain names. Most browsers have taken steps to penalise mixed-character-set domain names by displaying them as hex unicode values, so users do not get confused so easily. ... However, that’s not a perfect solution as well. Some domain names, such as sap.com or chase.com can easily be constructed completely out of a single block in a non-latin character set.

The Unicode consortium publishes a list of easily confusable characters, which might be a nice reference to automatically check for potential scams.

3. Normalisation isn’t that normal

Normalisation is very important for identifiers, such as usernames, to help people enter values in different ways but process them consistently. One common way of normalising identifiers is to transform everything into lowercase, making sure that JamesBond? is the same as jamesbond.

With so many similar characters and overlapping sets, different languages or unicode processing libraries might apply different normalisation strategies, potentially opening security risks if normalisation is done in several places. In short, don’t assume that lowercase transformations work the same in different parts of your application. Mikael Goldmann from Spotify wrote up a nice incident analysis about this issue in 2013, after one of their users discovered a way to hijack accounts. Attackers could register unicode variants of other people’s usernames (such as ᴮᴵᴳᴮᴵᴿᴰ), which would be translated to the same canonical account name (bigbird). Different layers of the application normalised the word differently, allowing people to register spoof accounts but reset the password of the target account.

4. There is no relationship between screen display length and memory size

... There are lovely symbols such as Bismallah Ar-Rahman Ar-Raheem (U+FDFD), a single character longer than most English words, easily breaking out of assumed visual enclosures in web sites. ...

5. Unicode is more than just passive data

Some code points are designed to impact how the printable characters get displayed, meaning that users can copy and paste more than just data — they can enter processing instructions as well. A common prank is to switch text direction using the right-to-left override (U+202E). For example, make Google Maps look for Ninjas. The query string actually flips the direction of the search word, and though the page displays ‘ninjas’ in the search field, it actually searched for ‘sajnin’.

...

Another particularly problematic type of processing instructions for display are variation selectors. In order to avoid creating a separate code for each colour variant of each emoji, Unicode allows mixing basic symbols with colours using a variation selector. A white flag, variation selector and rainbow would normally produce a rainbow-coloured flag. But not all variations are valid. In January 2017 a bug in iOS unicode processing allowed pranksters to remotely crash iPhones by just sending a specially crafted message. The message contained a white flag, a variation selector, and a zero. iOS CoreText? went into panic mode trying to pick the right variant and crashed the OS. The trick worked in direct messages, group chats, even with sharing contact cards. The problem affected iPads as well, and even some MacBook? computers. There was pretty much nothing the target of the prank could do to prevent the crash.

Similar bugs happen every few years. In 2013, a bug with Arabic character processing surfaced that could crash OSX and iOS. All these bugs were buried deep into OS text handling modules, so typical client application developers would not be able to prevent them at all.

---

because utf-8 is 'backwards compatible' with ASCII, maybe could accept it in strings in source code, even if it is not semantically understood?

---

arguments in favor of UTF-8 over other Unicode encodings [1]:

" Comparison with single-byte encodings

UTF-8 can encode any Unicode character, avoiding the need to figure out and set a "code page" or otherwise indicate what character set is in use, and allowing output in multiple scripts at the same time. For many scripts there have been more than one single-byte encoding in usage, so even knowing the script was insufficient information to display it correctly.
The bytes 0xFE and 0xFF do not appear, so a valid UTF-8 stream never matches the UTF-16 byte order mark and thus cannot be confused with it. The absence of 0xFF (0377) also eliminates the need to escape this byte in Telnet (and FTP control connection).
UTF-8 encoded text is larger than specialized single-byte encodings except for plain ASCII characters. In the case of scripts which used 8-bit character sets with non-Latin characters encoded in the upper half (such as most Cyrillic and Greek alphabet code pages), characters in UTF-8 will be double the size. For some scripts, such as Thai and Devanagari (which is used by various South Asian languages), characters will triple in size. There are even examples where a single byte turns into a composite character in Unicode and is thus six times larger in UTF-8. This has caused objections in India and other countries.
It is possible in UTF-8 (or any other multi-byte encoding) to split or truncate a string in the middle of a character. This can result in an invalid string which some software refuses to accept. A good parser should ignore a truncated character at the end, which is easy in UTF-8 but tricky in some other multi-byte encodings.
If the code points are all the same size, measurements of a fixed number of them is easy. Due to ASCII-era documentation where "character" is used as a synonym for "byte" this is often considered important. However, by measuring string positions using bytes instead of "characters" most algorithms can be easily and efficiently adapted for UTF-8. Searching for a string within a long string can for example be done byte by byte; the self-synchronization property prevents false positives.
Some software, such as text editors, will refuse to correctly display or interpret UTF-8 unless the text starts with a byte order mark, and will insert such a mark. This has the effect of making it impossible to use UTF-8 with any older software that can handle ASCII-like encodings but cannot handle the byte order mark. This, however, is no problem of UTF-8 itself but one of bad software implementations.

Comparison with other multi-byte encodings

UTF-8 can encode any Unicode character. Files in different scripts can be displayed correctly without having to choose the correct code page or font. For instance Chinese and Arabic can be supported (in the same text) without special codes inserted or manual settings to switch the encoding.
UTF-8 is self-synchronizing: character boundaries are easily identified by scanning for well-defined bit patterns in either direction. If bytes are lost due to error or corruption, one can always locate the next valid character and resume processing. If there is a need to shorten a string to fit a specified field, the previous valid character can easily be found. Many multi-byte encodings are much harder to resynchronize.
Any byte oriented string searching algorithm can be used with UTF-8 data, since the sequence of bytes for a character cannot occur anywhere else. Some older variable-length encodings (such as Shift JIS) did not have this property and thus made string-matching algorithms rather complicated. In Shift JIS the end byte of a character and the first byte of the next character could look like another legal character, something that can't happen in UTF-8.
Efficient to encode using simple bit operations. UTF-8 does not require slower mathematical operations such as multiplication or division (unlike Shift JIS, GB 2312 and other encodings).
UTF-8 will take more space than a multi-byte encoding designed for a specific script. East Asian legacy encodings generally used two bytes per character yet take three bytes per character in UTF-8.

Comparison with UTF-16

Byte encodings and UTF-8 are represented by byte arrays in programs, and often nothing needs to be done to a function when converting from a byte encoding to UTF-8. UTF-16 is represented by 16-bit word arrays, and converting to UTF-16 while maintaining compatibility with existing ASCII-based programs (such as was done with Windows) requires every API and data structure that takes a string to be duplicated, one version accepting byte strings and another version accepting UTF-16.
Text encoded in UTF-8 will be smaller than the same text encoded in UTF-16 if there are more code points below U+0080 than in the range U+0800..U+FFFF. This is true for all modern European languages.
Text in (for example) Chinese, Japanese or Devanagari will take more space in UTF-8 if there are more of these characters than there are ASCII characters. This is likely when data mainly consist of pure prose, but is lessened by the degree to which the context uses ASCII whitespace, digits, and punctuation.[nb 1]
Most of the rich text formats (including HTML) contain a large proportion of ASCII characters for the sake of formatting, thus the size usually will be reduced significantly compared with UTF-16, even when the language mostly uses 3-byte long characters in UTF-8.[nb 2]
Most communication (e.g. HTML and IP) and storage (e.g. for Unix) was designed for a stream of bytes. A UTF-16 string must use a pair of bytes for each code unit:
The order of those two bytes becomes an issue and must be specified in the UTF-16 protocol, such as with a byte order mark.
If an odd number of bytes is missing from UTF-16, the whole rest of the string will be meaningless text. Any bytes missing from UTF-8 will still allow the text to be recovered accurately starting with the next character after the missing bytes.

" ---

Python's new f-strings

print(f'{batch:3} {epoch:3} / {total_epochs:3} accuracy: {numpy.mean(accuracies):0.4f}±{numpy.std(accuracies):0.4f} time: {time / len(data_batch):3.2f}')

" -- [2]

note:

{} delimiters
: delimiter within the {}, because in Python you never need : within an atomic expression
expressions are evaluated inline, no need to pass them in via sprintf-like function. This allows you to have the name of the thing you are printing right in the string without having to duplicate that name a bunch of times.
colon with formatting specifier afterwards

---

[3]

" Be Careful with Python's New-Style String Format

...you can access attributes and items of objects

...whoever controls the format string can access potentially internal attributes of objects...

Here is an example from a hypothetical web application setup that would leak the secret key:

CONFIG = { 'SECRET_KEY': 'super secret key' }

class Event(object): def __init__(self, id, level, message): self.id = id self.level = level self.message = message

def format_event(format_string, event): return format_string.format(event=event)

If the user can inject format_string here they could discover the secret string like this:

{event.__init__.__globals__[CONFIG][SECRET_KEY]} "

he goes on to construct and suggest use of a 'safe_format' alternative

---

in Python:

name = "Fred" f"He said his name is {name}."

---

however, regarding Python's "f-strings", note that it's easier to type %s than {}

---

earenndil 49 days ago [-]

sds? https://github.com/antirez/sds

---

[–]djmattyg007 256 points 3 months ago

My number one gripe with C is how painful string handling is. There are so many ways to shoot yourself in the foot just trying to perform basic string manipulations like splitting based on a specific character.

Does C2 do anything to improve on this situation?

    permalinkembedsavereportgive goldreply

[–]qiwi 73 points 3 months ago

Without adding language level garbage collection or a complex lifetime management system (Rust) I'm not sure what you can do short of adding a std::string like system. And now you want to split strings, so maybe you'll want a vector too, and suddenly you start creating unique iteration macros to make it easier etc.

There are probably 1000 libraries out there trying to fix this in their own way. Here's one I looked at some time ago: https://github.com/faragon/libsrt

    permalinkembedsaveparentreportgive goldreply

[–]dvirsky 37 points 3 months ago

It's worth checking sds, the string handling library from redis, that's completely standalone and very nice to use.

https://github.com/antirez/sds

    permalinkembedsaveparentreportgive goldreply

[–]eresonance 7 points 3 months ago

Second this, I use it when I have to deal with strings.

    permalinkembedsaveparentreportgive goldreply

---

"The built-in String type can now safely hold arbitrary data. Your program won’t fail hours or days into a job because of a single stray byte of invalid Unicode. All string data is preserved while indicating which characters are valid or invalid, allowing your applications to safely and conveniently work with real world data with all of its inevitable imperfections."

[4]

---

Rust says: " Note: String slice range indices must occur at valid UTF-8 character boundaries. If you attempt to create a string slice in the middle of a multibyte character, your program will exit with an error. For the purposes of introducing string slices, we are assuming ASCII only in this section; a more thorough discussion of UTF-8 handling is in the “Strings” section of Chapter 8. " [5]

this sort of thing is another reason (besides the binary bloat) that i want to just stick to ASCII for the core library: there's just a lot more API complexity in unicode handling.

---

" Many (most? all?) languages have an approximation or equivalent of the venerable sprintf, whereby variable input is formatted according to a format string. Rust’s variant of this is the format! macro (which is in turn invoked by println!, panic!, etc.), and (in keeping with one of the broader themes of Rust) it feels like it has learned from much that came before it. It is type-safe (of course) but it is also clean in that the {} format specifier can be used on any type that implements the Display trait. I also love that the {:?} format specifier denotes that the argument’s Debug trait implementation should be invoked to print debug output. More generally, all of the format specifiers map to particular traits, allowing for an elegant approach to an historically grotty problem. There are a bunch of other niceties, and it’s all a concrete example of how Rust uses macros to deliver nice things without sullying syntax or otherwise special-casing. None of the formatting capabilities are unique to Rust, but that’s the point: in this (small) domain (as in many) Rust feels like a distillation of the best work that came before it. " [6]

---

" include_str! ...

  I really like the syntax that Rust converged on: r followed by one or more octothorpes followed by a quote to begin a raw string literal, and a quote followed by a matching number of octothorpes followed to end a literal, e.g.:

    let str = r##""What a curious feeling!" said Alice"##;

This alone would have allowed me to do what I want, but still a tad gross in that it’s a bunch of JavaScript? living inside a raw literal in a .rs file. Enter include_str!, which allows me to tell the compiler to find the specified file in the filesystem during compilation, and statically drop it into a string variable that I can manipulate:

        ...
        /*
         * Now drop in our in-SVG code.
         */
        let lib = include_str!("statemap-svg.js");
        ...

So nice! Over the years I have wanted this many times over for my C, and it’s another one of those little (but significant!) things that make Rust so refreshing. "

(the commentator notes that in another language, before porting their program to Rust, they saved the string in a resource file and then opened and read that file at runtime; but that's annoying, because then they have to get into the details of where the resource file is installed in the user's installation)

---

some thoughts on string encodings in here, i haven't read it:

https://lobste.rs/s/7hrgbb/how_string_rust

---

~ Corbin 5 hours ago

link

flag

To pick one of my favorite examples, I talked to the author of PEP 498 after a presentation that they gave on f-strings, and asked why they did not add destructuring for f-strings, as well as whether they knew about customizeable template literals in ECMAScript, which trace their lineage through quasiliterals in E all the way back to quasiquotation in formal logic. The author knew of all of this history too, but told me that they were unable to convince CPython’s core developers to adopt any of the more advanced language features because they were not seen as useful.

I think that this perspective is the one which might help you understand. Where you see one new feature in PEP 498, I see three missing subfeatures. Where you see itertools as a successful borrowing of many different ideas from many different languages, I see a failure to embrace the arrays and tacit programming of APL and K, and a lack of pattern-matching and custom operators compared to Haskell and SML.

https://lobste.rs/s/dpdmcg/structural_pattern_matching_python_3_10#c_hkemny

---

discussion on low-level string representations: https://www.reddit.com/r/C_Programming/comments/nqkn93/comment/h0c6kt2/?utm_source=reddit&utm_medium=web2x&context=3

---