notes-computer-net-evadingZookosTriangle

See also [1].

There are various attempts to evade Zooko's triangle (which, remember, isn't a theorem, just an informal hypothesis). The ones i am aware of rely on a globally shared mapping between numbers and human-meaningful names.

Related: if you're looking at this for passwords, another option is a passmaze ( https://eprint.iacr.org/2005/434.pdf ).

General notes

An 8-character password that must contain at least one of each of a lower case letter, an upper case letter, a digit, and a punctuation character has about 10x15 possibilities (about 50 bits) [2].

An IPv4 address is 32 bits. An IPv6 address is 128 bits.

Mapping numbers to lists of existing natural-language words

The average receptive vocabulary of an adult native English speaker may be around 10000 [3] [4] (although some give estimates of 40000 [5]) (i can't so easily find summary estimates for the active vocabulary size of native speakers; http://www.robwaring.org/papers/phd/ch2.html contains some, mostly ranging from 50% to 90% of active vocabulary size, with many of the results between 60% and 80%). However, the passive vocabulary size needed to speak English fluently may be as small as 2500 words [6], and i sometimes see references to vocabularies of 3000 and 5000 words [7].

Human short-term memory can hold about five words ("The number of chunks a human can recall immediately after presentation depends on the category of chunks used (e.g., span is around seven for digits, around six for letters, and around five for words)," -- [8]). If there are 8000 words in the table, five words can map to any 64-bit number. If there are 2500 words in the table, five words can map to any 56-bit number, and three words can map to any 32-bit number.

Criteria for the word list may include commonly used words, words with few letters, words with few syllables, words with simple, concrete meanings, words that are easy to pronounce, words that are semantically dissimilar to all other words on the list, words that sound different from all other words on the list, words that are not prefixes of other words on the list.

Some potential word lists:

Disadvantages of this compared to made-up words are that it takes longer to type a string of words than one made-up word.

Mapping numbers to made-up words

Criteria may include words that are short, words that are pronouncable for speakers of one language, words that are pronouncable for speakers of many languages, words with a single, obvious, correct pronounciation, words whose distance from with other words in the mapping is large in terms of spelling or syllable choice, words whose distance from with other words in the mapping is large in terms of their round, words that 'sound like' natural language words in one or more languages, words that have a certain meter.

Regarding the criterion, 'words that are pronouncable for speakers of many languages', see the comments on phoneme choice by Daniel Clelland and Mlatu and Sam Atman below in the notes from the discussion on changing Urbit's phonemic mapping, below.

Proposals:

Discussion

Seems like it's going to be hard to get memorability for even 64 bits, much less more than 64 bits:

32-bits seems feasible, though; that's only three words in a 2500-word table.

One crazy idea: it might be easier to remember a few words AND a short made-up word, rather than many words, or a long made-up word. So, i might use the Tirosh word list for 3 words (for the first 32 bits) plus either a simplified version of one of Clelland's alternatives (simplified because most applications don't demand that prefixes not be eyeballable, in fact in many cases that would be a benefit) or possibly rufus-mnemo, which i haven't looked at (for the last 32 bits; 6 syllables for Clelland). (or maybe the made-up word should go first, i dunno; in fact, if it were me, i'd probably put the made-up word first).

Notes

notes from discussion on changing urbit's phonemic mapping

https://groups.google.com/forum/#!msg/urbit-dev/zW3rgpX_AxQ/9l3VDKHzDlQJ


 	daniel.clelland 	10/11/13 Re: [urbit] Re: phonetic base I too can't stand seeing the y used the way it is. I would also suggest avoiding the use of the consonants c, q, and x as these can have fairly ambiguous pronunciations.

In fact, if you go through the English alphabet, it is possible to select characters for which the phonetics are reasonably unambiguous:

Unvoiced stops: p t k Voiced stops: b d g Nasals: m n Fricatives: f s h Voiced fricatives: v z Liquids: l Vowels: a e i o u

Omitted: c j q y w r

For comparison your existing system uses:

Vowels: a e i o u y Consonants: b c d f g h l m n p r s t v w x z

Omitted: k j q

More research could easily be put into this in order to choose phonemes which remain unambiguous across a larger number of languages, as well as choosing appropriate onsets, vowels, and codas for each alternating syllable in order to preserve rhythm, avoid dissonant consonant clusters, and avoid sounding too soulless and mechanical. At this point it's art.

Also, I'd say the five vowels 'a e i o u' are kind of a good thing to settle on as quite a few languages seem to do the whole five vowels thing, esp. ones that use syllabaries like Japanese, Korean, and the Indic languages. Mandarin I believe has these five plus ü. With those five vowels you've just catered a pretty sizeable chunk of the world population.

---

Purely as a counterexample, years ago I made a pretty crappy PHP thing that turned 28-bit numbers into Carollesque-sounding English nonsense words, e.g. 74,609,592 and 22,998,382 output "glundtooth brovsmoths" (which I think has a whole lot more flavour than something like '~hidret-matped'). Check it out:

http://protonome.com/angkode.php

(I think I chose 28 bits because I was trying to encode a pair of 14-bit lat/long coords with it. Now that I look at this thing again, I realise that my crappy process isn't even reversible, but there you go)


daniel.clelland 10/11/13 Re: [urbit] Re: phonetic base I tell you, I'm glad I found something as crazy as Urbit to work on...

Anyhow, roger that, I'll start by setting some initial requirements:

Things to keep:

1. zod and doz as 0 and 255. 2. CVC format 3. Simple algorithm, speedy calculation

Things to change, in order:

1. Optimise the set of vowels and consonants for maximum compatibility with world languages (I'll probably have a look at Spanish, Hindi and the CJK languages, plus perhaps French, German, Russian and Arabic). 2. Optimise the selection of onsets and codas to select for good consonant clusters between syllables* 3. Optimise the selection of vowels to retain a sense of stress and rhythm (English is already quite good at this)

An solution I'd be happy with would simply be to come up with a set of consonants which can be dropped straight into place in the existing code, but which has a noticeably better flow to it. Will be looking at more complex options, however.

Any objections?

Also, Curtis, could I ask how you generated your current set? Surely they didn't just fall off the back of a Martian truck.


Notes on consonant clusters:

---

mlatu 10/12/13 Re: [urbit] Re: phonetic base may i point you to the consonant and vowel set of lojban? its not perfect, as l and r might be difficult to pronounce but i find the rest quite easy to pronounce. http://lojban.org/publications/level0/brochure/phonapp.html

Am Samstag, 12. Oktober 2013 04:58:13 UTC+2 schrieb daniel.clelland:

https://github.com/dclelland/scratch/blob/master/urbit-phonetics.md

---

sam atman 10/13/13 Re: [urbit] Re: phonetic base Cool!

I'll note that we hit on very similar patterns viz. what letters are usable and what aren't.

The differences are minor, but worth discussing at some point. My scheme is designed to avoid conflation; I'll add more to the page at some point about how that works in Real Life.

Most importantly, your scheme won't do what you want it to. As you noticed, the phonemic scheme is little-endian, so that ~dozzod is 0, ~doznec 256, ~dozbud 512 and so on.

I think with CVCVCV you wanted "CVC" for the bottom 0-255 and then CVCVCV for the two byte numbers. It doesn't work that way on purpose, so that adjacent numbers have the same suffix but different prefix. That's so if you have two destroyers which are adjacent, they look like, say, ~lisbes and ~sogbes, not ~besfoo and ~besbar.

Using CVCVCV means you can slip `e` back in, but there are a few problems. A minor one is that it is pronounced CV/CV/CV in most cases, which means the mouth boundaries aren't the byte boundaries. Who cares? Not I and not a computer.

More importantly, you've traded a pair of trochees for a pair of dactyls. Instead of solid names like Thomas Miller and Dylan Grossman, we now have Aleister Futterbarm and Engelbert Humperdinck.

A submarine has 24 syllables rather than the already weighty 16. A ticket, at 8 syllables, can almost be held in short-term memory: bumping it up to twelve thoroughly exceeds our stack, which can deal with seven-digit phone numbers but cannot, as a rule, handle 16 digit credit cards without explicit memorization.

Also, waaaay at the bottom where it deserves to be: that sounds nothing like Polynesian, because it has no VV constructions, which are a signature of the language. It looks more like transliterated mBantu, except for the part where everything is exactly three syllables. It certainly flows off the tongue quite merrily, at least for this native English speaker.

cheers, -Sam.

I still like the scheme we have slightly better, for a couple of reasons. One, it has of course fewer syllables, which I think affects memorization cost. Two, CVCCVC sounds more like English, and most early adopters are English speakers rather than... I don't know, your system kind of has a Swahili feel to me.

Frankly, it pisses me off that we can't just use them both. But...

On Sun, Oct 13, 2013 at 7:09 AM, daniel.clelland <daniel....@gmail.com> wrote:

---

 	daniel.clelland 	10/14/13 Re: [urbit] Re: phonetic base I've just completed a CVCCVC-format system, check it out:

https://github.com/dclelland/scratch/blob/master/urbit-phonetics-b.md

Unfortunately, I think that it's that very same systematicism that does this mapping in. The mapping we've got now is much less algorithmically elegant. But since we're optimizing for the result of the mapping rather than the elegance of its construction, I think the best bespoke mapping will beat the best systematic mapping. Is our mapping the best possible? Definitely not. But I think it still beats this one.

We tried to optimize for a. distance between names, and b. sounding reasonably English (or, memorability to English speakers). I think any candidate alternate has to win on both. This one is pretty close on a. but nowhere near on b.

    -- 
    You received this message because you are subscribed to the Google Groups "urbit" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to urbit-dev+...@googlegroups.com.
    To post to this group, send email to urbi...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

---

Curtis Yarvin 10/14/13 Re: [urbit] Re: phonetic base Yes, definitely. Getting a mathematical mapping is not critical. So, a customized mapping is always going to beat a mathematical one - even though you use math to generate the original. Also, "sounding like English" is a definite goal. In practice, it is more important to sound like English than to cater to non-English speakers, although ideally we hit both.

I want to clarify the process by which we'll decide if we accept a change like this. (Please don't use the word "bikeshedding" - I'm sorry, but I know bikeshedding when I see it.)

Urbit may have emperors but it's still a republic - as in _res publica_. Therefore, in cases where the BDFL has a conflict of interest - such as this one (because making the change would be a bunch of boring ass work for me) I propose a simple procedure as follows:

(1) The House votes on teh bill. The House consists of: everyone who's made a commit, one man one vote.

(2) The Senate votes on teh bill. The Senate consists of: all carriers held by individuals, one carrier one vote. The Senate needs to think very strongly about defying the will of the House.

I reserve the right to defy teh House and teh Senate. But, I would need to think *very* strongly about that. For phonemes... no way.

---

 	mlatu 	10/15/13 Re: [urbit] Re: phonetic base that one looks good and is, as john said, surprisingly lojbo in appearance (though i miss the ocasional x :P though i suppose many people have problems with that or might get a sore throat)

is it ok that dar, zuv, kar, nar, lan, gun, ped and deb (only checked the first row of even sinistra) are all in multiple blocks (odd/even Sinistra/Dextra)? (sorry if the answer to the question is in the text describing the process of generating those sillables... was more interested in the result)

Am Dienstag, 15. Oktober 2013 06:39:26 UTC+2 schrieb daniel.clelland:

---

 	curtis 	Feb 25 Re: [urbit] Re: phonetic base Anton,

I think most people are happy with the phonemes at this point. In fact, there is a constituency building for a continuity breach even though we don't need one, because some people have sunk their ships and want their old names back.

It's a little tricky because it would screw up the tickets as well, but if there's one thing I'd like to change it's simply the syllable 'por' - given that a lot of the back syllables start with 'n', we get the likes of 'pornyx'. Which is for some people. But not, you know, others.

I also need to take this opportunity to reiterate that the 'y' is never a schwa. The official pronunciation of 'tasfyn-partyv' rhymes with "that's wine part hive." I can't punish anyone for not using the correct official pronunciation, but I can threaten...

On Mon, Feb 24, 2014 at 10:30 PM, Anton Dyudin <antec...@gmail.com> wrote:

    gruelty,
    Why wouldn't it be reversible? I'm uncertain what function the odds/evens provide without having access to the linked spec, but that's easily disambiguated by looking at the third character; after that, you have remainders modulo 5, 7, and 12, which are mutually prime and consequently uniquely specify one modulo 420 (in this case we're supposed to have a number 0-255). As for reconstructing it, even simply repeatedly adding 7 and then 35 has at worst under 20 iterations.
    --
    You received this message because you are subscribed to the Google Groups "urbit" group.
    To unsubscribe from this group and stop receiving emails from it, send an email to urbit-dev+...@googlegroups.com.
    To post to this group, send email to urbi...@googlegroups.com.
    For more options, visit https://groups.google.com/groups/opt_out.

---

 	Bruce Schwartz 	Feb 25 Re: [urbit] Re: phonetic base I thought the stuff daniel.clelland was working on was pretty promising. The current phonemes are a bit Martiany for my taste. I'd like something more Earthy.

I've wondered how hard it would be two switch, or to even maintain two mappings at once. If there were no first syllable collisions between two systems it ought to be pretty easy to figure out which mapping is being used while parsing. Or the new mapping could have a different prefix than a single ~.

---

daniel.clelland Feb 25 Re: [urbit] Re: phonetic base I would just like to assert that I still believe my original CVCVCV proposal, while increasing the number of syllables from two to three, makes up for this with phonetic unambiguity as well as colour, character, and flavour.

~gem ~higezo ~vurida-miteho ~domiha-darine-zoduhi-dapima ~hunivu-dapudo-deyede-gaduba--zufetu-tuyovu-toramu-limiva

---

	Curtis Yarvin 	Feb 26 Re: [urbit] Re: phonetic base This is an excellent discussion! Here are my several takes on the question.

One, the phonemic system mapping does not of course bar the possibility that these addresses will be hidden by a human-meaningful namespace of some sort - or any other lookup system that does not require you to remember ~hidret-matped or whatever.

The goal is to take the weight off any lookup system at the next level, by rendering it useful but not necessary. Of course it will always be useful. But the less necessary it is, the better.

Two, I will repeat my objection to Daniel's six-syllable names - which I like a lot and have a vaguely Japanese sound to my ear. And everything Japanese is cool. But. Syllable count is an important aspect of word memory - MRI studies of people reading silently show the vocal areas of their brains being activated. In all languages, poetic forms originally evolved as a way of making Homeric epics and the like something a person could memorize - without a written text at all.

The four-syllable CVCCVC-CVCCVC form we have at present scans naturally as a trochaic (FOObar-MOObaz) double foot. Every time I hear people say an Urbit name aloud, it always comes out as a trochee, even if they have no idea what a trochee is. This is by design. Meter in English works by stressed syllables (which are absent in many languages), and lots and lots of people have brain hardware for translating English into noise and vice versa. The natural meter of an English line is four feet, which corresponds to a 64-bit atom in @p. Ie, your ticket. 64 bits is long enough for any secret that can't be tested offline, so it's very nice to be able to use your English language hardware to remember a secure auto-generated password. (Of course there are other ways to solve this problem, eg, diceware.)

There is also a natural scansion for "vurida-miteho", of course - "vuRIda-miTEho." This is a dual amphibrach. Wikipedia reports: "In English accentual-syllabic poetry, an amphibrach is a stressed syllable surrounded by two unstressed syllables. It is rarely used as the overall meter of a poem, usually appearing only in a small amount of humorous poetry, children's poetry, and experimental poems." This is not a coincidence - it's because an amphibrach sounds kind of funny and unusual.

So despite being the same number of ASCII characters, Daniel's design has a higher level of cognitive load IMHO.

Three, it's absolutely essential that whatever scheme we end up with, we have an official IPA pronunciation guide. In practice, people will ignore this. But that's fine, it's just because people suck. But not all people suck, so it's essential to give them the opportunity not to suck...

---

sam atman Feb 26 Re: [urbit] Re: phonetic base

On Wed, Feb 26, 2014 at 9:22 AM, Curtis Yarvin <curtis...@gmail.com> wrote:

    This is an excellent discussion!  Here are my several takes on the question.
    Three, it's absolutely essential that whatever scheme we end up with, we have an official IPA pronunciation guide.  In practice, people will ignore this.  But that's fine, it's just because people suck.  But not all people suck, so it's essential to give them the opportunity not to suck...

My take on this hasn't shifted much: I think that 'y' is bad news and would like to see it gone. Any other objections are subtle and probably not worth pursuing. I'm not convinced two dactyls are harder or easier to memorize than two spondees, nor is one obviously easier to pronounce. Especially if one avoids 'e' to allow shwa insertion in difficult words. I agree with you about IPA and would go further: the vowels should be their IPA values exactly. In particular, you're using "y" for "i" and "i" for "ɪ" which is... very American.

Using four vowels means the eight to ten pronunciations Americans will lean on won't collide. As it is, I pronounce your personal ship identical to "tasfin-partiv" without active effort. The prefix "syn-" is always pronounced "sin", no exceptions. If you do a dictionary grep for the substring "yn" you'll see what I mean.

Moreover, it would be handy for a lot of reasons if the ship 'type' rendered the name at the last minute. Mostly for 'dumb' reasons like semantic highlighting, but I still like the idea of having a separate mapping to the Hanzi.

cheers, 'rizlus dopsim'

---

	curtis 	Feb 26 Re: [urbit] Re: phonetic base The 'y' *is* bad news. Because people will pronounce it wrong. But - I don't mind having one peculiarity like this in the system. It produces a kind of character - it lets the truly geeky follow the rules exactly and thus differentiate themselves from the less geeky, which geeky people love to do.

As for other renderings - of course, it's just a number, you can do whatever you like with a number. And yet: there are a lot of places where you want to call the print function, and all you have is the number. So there has to be a privileged canonical form.

---

 	nils 	Feb 26 Re: [urbit] Re: phonetic base Curtis Yarvin <cur...@tlon.io> writes:

> The 'y' *is* bad news. Because people will pronounce it wrong. But - I > don't mind having one peculiarity like this in the system. It produces a > kind of character - it lets the truly geeky follow the rules exactly and > thus differentiate themselves from the less geeky, which geeky people love > to do.

Prediction: Homophone attacks. Like the homoglyph attacks we have today.

Unicode domain names are a curse and need to be executed right after the POST method. If there's any ammo left after executing the POST method.

---