Bayle Shanks's website: proj-oot-ootAssemblyNotes10

perhaps, in addition to ALIASed memory locs, we should have memory locs that go thru getters/setters? That's kind of a generalization of boxing, come to think of it, or mb a special case (you call some program-specific boxed thingee whenever you want to read or write the boxed quantity; a different call for read and for write; the program can do the boxing/unboxing differently depending on which object it is). Also i guess this generalizes aliasing, because if it's boxed you can do aliasing in the boxing. Similarly, you can do the COW stuff this way. I like it.

i guess this really just IS boxing. But it's cool.

Maybe i'm persuaded to make boxing the other addr mode bit.

---

so i removed the following from the assembly spec:

To explicitly indicate addressing mode, use #MODE (where MODE is either an unsigned decimal integer or hex literal) before the operand, for example:

ADD #2 x, #0 2, #0 2

Note that when #MODE is used, you can use any of the previous syntaxes; the result will compiled into a 12-bit operand according to the above addressing-mode syntaxes, but then the actual addressing mode will be set via #MODE.

To use syntax to indicate the 3 low-order bits of the addressing mode but to manually indicate the high-order bit is 1, use '##', eg the following indicates the high-bit-set variants of register, indirect-read-post-increment-write-pre-decrement, indirect-read-post-increment-write-pre-decrement:

ADD ## x, ## POP, ## z--

in other words the previous is equivalent to:

ADD #12 x, #13 POP, #14 z--

or equivalently:

ADD #c x, #d POP, #e z--

ok so i was thinking above that since we can do so much with boxed mode (eg getters/setters, ALIAS, GC, dynamic type tags, COW), maybe i'll make it official. I guess the distinction between 'boxed' address modes and 'custom addr modes' is whether the addr mode itself gets passed to the user-provided custom function, or whether the function only gets told the effective address (if a write) or value (if a read) (hmm, what if there is a write to register 0, ie a DISCARD? does that even get passed? Probably not, right). So boxing is strictly less expressive than custom address modes. Which is fine; since it's already so powerful, and since i can't immediately think of any non-efficiency-related uses for full custom addr modes.

Now, these boxing functions. When we read or write in a boxed address mode, some custom code gets called. Is this custom code determined by:

the currently executing module
the address being accessed
the module that created the address subspace of the address being accessed
the value being accessed (or, the module that created that value)
the type of the value being accessed
the program being executed, once, statically
dynamically

The value being accessed would probably be the most expressive choice. But that would require passing around extra information with each such value. Otoh, the value is already 'boxed' anyways.

The static type of value being accessed is less dynamic but also very expressive. But that would require the implementation to care quite a bit about types, and to effectively support polymorphism. That's the kind of complicated thing that we wanted the Oot runtime to do for us, right?

The address being executed also sounds good, but it would require a big table of which boxing handler goes with which address.

The currently executing module sounds okay, but doesn't help us when data is being passed from one module to the next. It seems pretty useless for some library to be able to define its own boxing if that means that boxed values that it creates must be treated as opaque by others, and boxed values that it receives from others must be treated as opaque by it.

Dynamic sounds like it wouldn't be a ton of use (except for 'bootstrapping' type stuff) and would impede static optimization.

Choosing once per program sounds okay; this would aid static optimization, and it would allow us to fit in much of the other stuff into the per-program handler if we wanted it (which is better than making the Oot Bytecode implementer do it).

i kinda want it to be per subaddress space (non-directly-accessible address space), eg different spaces of the opaque integer addresses could have different boxing functions, but within each subaddress space it's the same.

How would that work? Maybe when you malloc, you pass in the boxing functions. Then either the opaque addresses are actually tuples that point to the boxing functions too, or the boxed objects in the space have fields that point to the boxing functions (actually, the latter idea is back to per-value, not per-address space).

This is sounding a lot like classes. 'Passing in boxing functions at malloc time' sounds a lot like 'constructing a new instance (which includes allocating for it) by specifying which class it is'. If we're doing something like that, we want to be able to determine the class of a given value at compile time, not only at runtime, because we want to be able to compile this stuff to eg Java, and because we want everything to have the efficiency of statically typed code at this low level, because we want the Oot implementation itself to be statically typed.

(later) how about the module that created the address subspace: perhaps this could be forced to be known at compile-time? The reason is that i was thinking about how we want to be able to 'monkeypatch' the primitive instructions on a per-module basis -- if that extended to subroutines called then it's kinda dynamic anyways but... mb not extend it to subroutines called for exactly that reason. does 'module that created the address subspace' really help that much over 'address subspace' anyways? it's not like this would make the compiler able to decide at compile-time how to specialize each callsite another argument in favor of 'address subspace' is that, since our 'pointer arithmetic' in address subspaces is logical rather than physical (+1 gets to the next field in the struct, not to the next byte), the implementation already has to do some dynamic indirection in order to do anything. for this reason, perhaps introduce a notion of 'address subspace type' and require that for each instruction, the 'address subspace type' of each operation for that instrution must be known at compile-time? the annotations specifying this sounds like a lot of additional data to lug around

Attaching the boxing functions per-instance sounds more like prototypes, or like late-bound dynamic typing OOP.

So right now it seems like the two best choices might be:

per-program
per-type

Per-program is simpler and less expressive, and per-type is more expressive. I guess the question is, how much harder to implement would per-type really be. I was already thinking about giving types to everything, but i hadn't really decided if stack maps etc would actually be mandatory, or if the VM does verification upon load (like the JVM). And how much would it slow down things, and make it harder to implement, for a naive interpreter implementation to be always looking up the type of everything? Wouldn't they have to be walking through the stack maps upon each instruction, or at least be maintaining some sort of parallel stack with type tags? Well, actually, it's no that hard, just keep a type tag with everything. Hey, it's already boxed, right? But now what about compilers. Yes, they have to keep up with the stack maps and find the type of everything, at least if they're compiling to a static target where they can't just make the target dynamically consult the type tags. But if they have a static target, then they're looking at the types anyways, right?

What about my idea that at the Oot Bytecode (i should start saying OVM, Oot VM) level, things should not be polymorphic? Because we don't want the inefficiency of runtime polymorphic dispatch, we don't want the inefficiency of recompiling everything multiple times, once for each type, we want the benefits of monomorphic inline caches. Well, i already said i want generics, so we can't have all that all the time.

And i'm still worried about compiling to non-polymorphic static languages like C. Well, the fact is that Oot Core will be polymorphic, so stuff in C is gonna get clunky anyways.

mb see: http://beust.com/weblog/2011/07/29/erasure-vs-reification/ http://stackoverflow.com/questions/879855/what-are-reified-generics-how-do-they-solve-type-erasure-problems-and-why-cant http://stackoverflow.com/questions/1927789/why-should-i-care-that-java-doesnt-have-reified-generics http://programmers.stackexchange.com/questions/176665/generics-and-type-erasure http://stackoverflow.com/questions/355060/c-sharp-vs-java-generics http://stackoverflow.com/questions/31693/what-are-the-differences-between-generics-in-c-sharp-and-java-and-templates-i https://news.ycombinator.com/item?id=8381870

up vote 14 down vote

This is an old question, there are a ton of answers, but I think that the existing answers are off the mark.

"reified" just means real and usually just means the opposite of type erasure.

The big problem related to Java Generics:

    This horrible boxing requirement and disconnect between primitives and reference types. This isn't directly related to reification or type erasure. C#/Scala fix this.
    No self types. JavaFX 8 had to remove "builders" for this reason. Absolutely nothing to do with type erasure. Scala fixes this, not sure about C#.
    No declaration side type variance. C# 4.0/Scala have this. Absolutely nothing to do with type erasure.
    Can't overload void method(List<A> l) and method(List<B> l). This is due to type erasure but is extremely petty.
    No support for runtime type reflection. This is the heart of type erasure. If you like super advanced compilers that verify and prove as much of your program logic at compile time, you should use reflection as little as possible and this type of type erasure shouldn't bother you. If you like more patchy, scripty, dynamic type programming and don't care so much about a compiler proving as much of your logic correct as possible, then you want better reflection and fixing type erasure is important.

" Mostly I find it hard with serialization cases. You often would like to be able to sniff out the class types of generic things getting serialized but you are stopped short because of type erasure. It makes it hard to do something like this deserialize(thingy, List<Integer>.class) – Cogman Aug 6 '14 at 23:10 "

http://beust.com/weblog/2011/07/29/erasure-vs-reification/ :

gotchas with type erasure:

type, replaced it with Object (since that’s exactly what’s happening behind the scenes):

Overloading public class Test<K, V> { public void f(K k) { }

  public void f(V v) {
  }} T.java:2: name clash: f(K) and f(V) have the same erasure public void f(K k) { ^ T.java:5: name clash: f(V) and f(K) have the same erasure public void f(V v) {

The workaround here is simple: rename your methods. Introspection public class Test { public <T> void f() { Object t; if (t instanceof List<T>) { ... } } } Test.java:6: illegal generic type for instanceof if (t instanceof List<T>) {}

There is no easy workaround for this limitation, you will probably want to be more specific about the generic type (e.g. adding an upper bound) or ask yourself if you really need to know the generic type T or if the knowledge that t is an object of type List is sufficient. Instantiation public class Test { public <T> void f() { T t = new T(); } } view source print ? Test.java:3: unexpected type found : type parameter T required: class T t = new T();

Generics is a complicated language feature. It becomes even more complicated when added to an existing language that already has subtyping. These two features don’t play very well together in the general case, and great care has to be taken when adding them to a language. Adding them to a virtual machine is simple if that machine only has to serve one language – and that language uses the same generics. But generics isn’t done. It isn’t completely understood how to handle correctly and new breakthroughs are happening (Scala is a good example of this). At this point, generics can’t be considered “done right”. There isn’t only one type of generics – they vary in implementation strategies, feature and corner cases.

What this all means is that if you want to add reified generics to the JVM, you should be very certain that that implementation can encompass both all static languages that want to do innovation in their own version of generics, and all dynamic languages that want to create a good implementation and a nice interfacing facility with Java libraries. Because if you add reified generics that doesn’t fulfill these criteria, you will stifle innovation and make it that much harder to use the JVM as a multi language VM.

I’m increasingly coming to the conclusion that multi language VM’s benefit from being as dynamic as possible. Runtime properties can be extracted to get performance, while static properties can be used to prove interesting things about the static pieces of the language.

Just let generics be a compile time feature. If you don’t there are two alternatives – you are an egoist that only care about the needs of your own language, or you think you have a generic type system that can express all other generic type systems. I know which one I think is more likely. '

" value types would be awesome. F# is a good example of such a language (scala and clojure aren't)

jdmichal 664 days ago [-]

> a compromise would perhaps be to allow at least reified primitive generics so that List<int> would be a proper runtime type whereas List<Object> would be the runtime version of all reference types in List<T>.

This is actually pretty much exactly what the .NET runtime does. All value types get a separately-JIT'd version of the type, while all reference types share the same Object-based version. See section on implementation here:

http://msdn.microsoft.com/en-us/library/ms379564(v=vs.80).as...

pron 664 days ago [-]

The work is already underway in Project Valhalla: http://openjdk.java.net/projects/valhalla/

Once it's been decided Java should get value types, generic reification became a more urgent necessity. "

alkonaut 664 days ago [-]

Mutable structs in c# are rare, rarely useful and often dangerous, but in some perf critical scenarios that hack is needed. Most notably the standard List<T>.Enumerator is a mutable struct because othrrwise an object would have to be created on the heap for the sole purpose of iterating a list.

Not sure whether this was actually one of the reasons for allowing mutable structs to begin with, the use case is critical enough that it might well have been. "

"Java uses the notion of type erasure to implement generics. In short the underlying compiled classes are not actually generic. They compile down to Object and casts. In effect Java generics are a compile time artifact and can easily be subverted at runtime.

C# on the other hand, by virtue of the CLR, implement generics all they way down to the byte code. The CLR took several breaking changes in order to support generics in 2.0. The benefits are performance improvements, deep type safety verification and reflection. "

Java: " When a generic type is instantiated, the compiler translates those types by a technique called type erasure — a process where the compiler removes all information related to type parameters and type arguments within a class or method. Type erasure enables Java applications that use generics to maintain binary compatibility with Java libraries and applications that were created before generics. "

C# " This design choice is leveraged to provide additional functionality, such as allowing reflection with preservation of generic types, as well as alleviating some of the limitations of erasure (such as being unable to create generic arrays). This also means that there is no performance hit from runtime casts and normally expensive boxing conversions. "

" Something that could be seen as a disadvantage of reifiable types (at least in C#) is the fact that they cause code explosion. For instance List<int> is one class, and a List<double> is another totally different, as it is a List<string> and a List<MyType?>. So classes have to be defined at runtime, causing an explosion of classes and consuming valuable resources while they are being generated. "

sounds like most everyone is saying that C#'s non-type-erasure/'reified-generics' system is better and that Java only did type erasure for backwards compatibility (so that programs with the new generics would still run on older JVMs). 'Reified generics' seem to mean that the VM is aware of the types somehow.

but:

Another advantage of erased generics, is that different languages that compile to the JVM employ different strategies for generics, e.g. Scala's definition-site versus Java's use-site covariance. Also Scala's higher-kinded generics would have been more difficult to support on .Net's reified types, because essentially Scala on .Net ignored C#'s incompatible reification format. If we had reified generics in the JVM, most likely those reified generics wouldn't be suitable for the features we really like about Scala, and we'd be stuck with something suboptimal. Quoting from Ola Bini's blog,

    What this all means is that if you want to add reified generics to the JVM, you should be very certain that that implementation can encompass both all static languages that want to do innovation in their own version of generics, and all dynamic languages that want to create a good implementation and a nice interfacing facility with Java libraries. Because if you add reified generics that doesn’t fulfill these criteria, you will stifle innovation and make it that much harder to use the JVM as a multi language VM.

Personally I don't consider the necessity of using a TypeTag? context bound in Scala for conflicting overloaded methods as a disadvantage, because it moves the overhead and inflexibility of reification from a global (whole program and all possible languages) to a use-site per language issue that is only the case less frequently.

-- http://programmers.stackexchange.com/a/212038/167249

"Writing programs in the presence of erasure *forces* you to avoid excessive coupling to runtime type knowledge, which is required if you actually want to write reusable code. " -- https://groups.google.com/forum/#!msg/scala-language/PV4q6O1qIh8/cy0vVFTMdr4J

ok so we probably want some sort of I/O in our 'primitive instructions'.

(1) because let's say you are writing a new Oot Bytecode implementation right now, the first thing you want to do is NOP, the next is to add 1+1. But after adding 1+1 you, the implementor, want to see if the result was right. So you can printout a dump of memory at the end of the program, but pretty soon you're going to want a PRINT command, or at least a LOG command.

2) In addition the idea was that a naive implementation would implement ONLY the primitives, but how are you going to have an Oot compiler or interpreter with no way to read the source code files or to write out object files? Again, you could just have your implementation have magic memory locations that it reads and writes to, but how will the standard Oot implementation know which those are? They have to be standardized, at which point you may as well have I/O instructions.

3) My belief is that the Turing machine abstraction falls a little short and what computers really are today are INTERACTIVE Turing-like machines. So we should put this interactivity into our primitive computational model.

4) We really want a LOG in there pretty early too.

so ppl say IOCP >? kqueue >> epoll >> poll > select. So should we implement something like IOCP or kqueue?

ppl say Erlang has better async I/O than Node [1] (that guy also seems to like Lua and Scheme, but says "You can make event-driven asynchronous systems pretty smoothly in languages with first class coroutines/continuations (like Lua and Scheme), but most libraries aren't written with that use case in mind")

On the one hand, one of the main features of Oot is supposed to be async I/O, and horizontal scalability/concurrency, so this stuff should be baked in pretty deep. On the other hand, the primitive instruction set does not have to be performant at all; performant implementations will override the stdlib instructions which is what everything else will actually use.

My picture of 'completion ports' is sort of like you add another layer of indirection. You create a new 'network' between you and the 'completionPort' entity. You then do blocking message sends and receives between you and Mr. completionPort. Mr. completionPort has DMA access to your memory; so when you want to send something to someone else, you put it in a buffer, then you send a message to Mr. completionPort telling him to read the buffer and send it to the counterparty at his leisure. Then when he gets around to it, he queues up a message for you telling you that he's finished. Eventually you check your mailbox and see that he's done, and (only) then you can reclaim the buffer and use it for something else.

I read http://joearms.github.io/2013/04/02/Red-and-Green-Callbacks.html and the point seems to be that the Erlang runtime handles the hard work of scheduling and doing async I/O but exposes to the Erlang program a logical view in which there are many separate threads doing blocking I/O; more discussion in proj-oot-ootNotes20; the MegaPipe? library speaks of this sort of thing approvingly.

Also, some of the stuff about kqueue is stuff that we can't do with primitives; eg have an array of mutations to the interest set rather than just sending one at a time; the primitives don't have arrays yet, that'll come later in the stdlib.

So perhaps all we really need is in fact SEND and RECV, or, OUT and IN.

But even these seem a little redundant; especially if we bring getters and setters into the Oot Bytecode realm, but even if we don't (because the implementation could still treat some "I/O" non-directly-accesible memory locations as effectively having getters and setters), then isn't the idea that ordinary assembly reading and writing to memory locations can do any 'read' and 'write' semantics that you need?

And as i noted above, IOCP actually does do DMA on your memory space; you allocate the buffer in your memory space but it is reading from or writing to that buffer while you are off doing something else. Which suggests, hey, why have SEND and RECV at all as a primitive, just have some area of (not directly accessible) virtual memory that is DMA. Then you have OPEN/CLOSE type commands to find the DMA memory, but that's it.

I recall that Shen's Kl (used to be Klambda?) had some basic file I/O, what does it do here?

http://shenlanguage.org/Documentation/shendoc.htm#Kl

looks like it has 4 fns: open, close, read-byte, write-byte. Sounds good.

i guess one thing for us to think about is message packets vs. streams. Do we write out a whole message, and then send it, or does each little bit of the message get sent as we write it?

also Go is said to have a good thing going with its channel abstraction. As i recall, channels can be sync or async, where sync is equal to async with a 0-length buffer.

also, one cool thing about one of those formal concurrent 'calculi' was that process IDs and channel IDs themselves (and what about channel port IDs, eg each side of the connection?) could be sent through channels.

(and what about unidirectional vs bidirectional channels?)

and what about FLUSH? Or SYNC? Or something to say when an outgoing buffer is in a consistent state (wouldn't that just be SEND?)?

guess i have to think about this a little more. Right now i've left IN, OUT, OPEN, CLOSE, LOG in there as primitive ops. We probably also need a SELECT or POLL or RECV or Erlang's pattern-matching receive or WAIT, eg something which blocks until it gets woken up by a message on some channel or matching some pattern, so i added that.

---

when implementing OVM, instead of discarding writes to a constant-mode destination operand, an implementation might just create a clone() of the constant in a temporary memory location and direct the writes there. That lets you just run the instruction as normal rather than snooping on each write.

But implementations CAN choose to actually discard, so instructions shouldn't assume that if they write to a memory location given as an input that that memory location will henceforth return the write; the write may have been discarded.

so instructions don't know which of the following will happen:

maybe they write to a memory location, and see their writes, but then the caller ignores them later
maybe they write to a memory location and don't even see their writes

this might not be worth it, as this prevents instructions from using the output to hold intermediate values that are also used elsewhere in the computation, which will sometimes be more efficient -- although on the other hand discarding these writes may be more efficient in other circumstances.

---

one idea for how to handle Views w/ boxing would be to let the output operand have full control; if boxed, the output operand provides the wrapper that can look at the other operands, and their addressing modes, and run the instruction.

of course some things have no or multiple output operands. We'll just say that operand1 is the one in control.

what if the output operand is unboxed but an input operand is boxed? Then we call the input operand's boxing function directly with a 'get'.

also, what if the output operand is the only one that's boxed? Mb then we call its boxing function directly with a 'set'.

Which constrains these boxing fns to only be boxing fns, not to actually be full custom addr modes.

the reasoning for letting it run the instruction in a wrapper is that if you are handling Views, some instructions might have different 'View signatures' eg whether or not the metadata in the box should be transmitted from input to output, and if it should, from which input. If we want such things to be extensible rather than primitive in OVM then the boxing code needs to be able to know the identity of the instruction. In which case, since it knows what the instruction is and what its inputs are and since it is executable code itself with total freedom to create the output, it could choose to run the instruction in a wrapper whether or not we want it to (although actually we don't HAVE to give it inputs which were unboxed).

however, just to make things a little simpler, we might choose to overlook that and run the instruction ourself and pass it the unboxed result.

also, instead of letting it directly write out the output, we could ask it to pass us the boxed output and we could write it into memory ourselves. In which case it doesn't even have to know the address it is going to be written to.

So the proposal is:

3 boxing fns:

get(): boxed value -> unboxed set1() if no boxed inputs: unboxed result -> boxed result set2(): unboxed result, operand2_isBoxed, operand3_isBoxed, operand2, operand3, instruction -> boxed result

---

should we make Oot Assembly isomorphic instead of homomorphic? Currently there are some 'synonyms' such as PUSH and --DS, NIL and constant 0, CURRENTMODULE and constant 1. One problem with making it truly isomorphic is that this would require eliminating small programmer conveniences such as whitespace and preprocessor constants. Note that the Webassembly textual format design doc claims that the text format is isomorphic to the binary one, but then it also says that "Multiple textual files can assemble to the same binary file, for example whitespace isn't relevant...", which suggests that it is only homomorphic.

so let's not make OA isomorphic

---

we absolutely need to include a module version in the module, so that two different versions of the same module can be loaded and used at the same time. This suggests that we include a string name, too.

---

so in order to actually implement OotB?, i'd have to actually define the file format, so i've been taking a quick look at what other languages use, and at stuff like ELF and PE. My initial impression is that these things often end up being 'overdesigned', that is, having a header with a lot of fields for generality/configurability, which end up never being used. Lua seems like one of the best in this regard (not having too much stuff that isnt used), but i think even Lua has some stuff about the bitwidth of things that is not actually used too often (although it is usable):

" A Lua 5.1 binary chunk header is always 12 bytes in size. Since the characteristics of a Lua virtual machine is hard-coded, the Lua undump code checks all 12 of the header bytes to determine whether the binary chunk is fit for consumption or not. All 12 header bytes of the binary chunk must exactly match the header bytes of the platform, otherwise Lua 5.1 will refuse to load the chunk. The header is also not affecte d by endianness; the same code can be used to load the main header of little-endian or big-endia n binary chunks. The data type of lua_Number is determined by the size of lua_Number byte and the integral flag together. In theory, a Lua binary chunk is portable; in real life, th ere is no need for the undump code to support such a feature. If you need undump to load all kind s of binary chunks, you are probably doing something wrong. If however you somehow ne ed this feature, you can try ChunkSpy’s? rewrite option, which allows you to convert a b inary chunk from one profile to another. Anyway, most of the time there is little need to ser iously scrutinize the header, because since Lua source code is usually available, a chunk can be readily c ompiled into the native binary chunk format. "

http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf

My sense is that what happens is that a platform is initially design to be portable (eg endian-agnostic) because it doesnt know where it will take off, but then as a platform's adoption takes off, it solidifies around one primary subplatform (eg x86, which is little endian), after which point the implementation and tooling needed to support the initial portability stops being updated/produced (eg most of the tooling stops supporting any value for the endian flag except LE), after which point the original flexibility is lost and there end up being vestigial stuff in the file format.

---

"We Need Hardware Traps for Integer Overflow" -- [2]

---

" ...

and 1-operand ISAs require far more instructions for data movement, while 3-operand ISAs may waste encoding space if much of the time, one source operand does not need to be preserved. 2 operands is a natural compromise, and this is what e.g. ARM Thumb does.

This is why I find the description of "compressed RISC-V" linked in the article ( http://www.eecs.berkeley.edu/~waterman/papers/ms-thesis.pdf ) interesting - benchmark analysis shows that 8 registers are used 60% of the time, and 2-operand instructions are encountered 36/31% statically/dynamically. These characteristics are not so far from those of an ISA that has remained one of the most performant for over 2 decades: x86. It's a denser ISA than regular RISCs, and requires more complex decoding, but not as complex as e.g. VAX. " [3]

"RISC-V papers found x86 to not be very dense. Average instruction is 4 bytes, intruction count is very low. 2 operand instructions are a little less flexible than 3" -- [4]

---

"It's important to remember C was designed for the processors of the time, whereas processors of today (RISC-V included) are arguebly primarily machines to run C. C has brought a lot of good but also a lot of bad that we are still dealing with: unchecked integer overflows, buffer under- and overflow, and more general memory corruption. No ISA since the SPARC even tries to offer support for non-C semantics. "

"unchecked integer overflows...This is my one big beef with the RISC-V ISA"

---

" RISC-V is carefully designed compromise; it scales down to extremely cheap cores and up to superscalar. Like Alpha before it, extreme attention has been paid to avoid features/choices that would be bad for OoOE? implementations. Some examples:

rs1, rs2, rd fields are always in the same location and all register sources and destinations are explicit (makes decoding faster and you can start fetching/renaming without having decoded)
there are no branch delay slots
instructions produce at most a single result
no condition codes etc (dependencies are explicit)
the sign-bit for all immediate fields is in a fixed location (cheaper sign-extension) "

---

"The key feature of this architecture ((Hwacha)) is the vector length register (VLR), which represents the number of vector elements that will be processed by the vector instructions, up to the hardware vector length (HVL). " -- [5]

(already looked at that, no need to reread)

---

"the 64-bit extension of ARM is a completely new and different ISA called AArch64 which incidentally is a lot more RISC-like than the original ARM" [6]

---

" Table 10.1. PSTATE fields PSTATE fields Description NZCV Condition flags Q Cumulative saturation bit DAIF Exception mask bits SPSel SP selection (EL0 or ELn), not applicable to EL0 E Data endianness (AArch32 only) IL Illegal flag SS Software stepping bit

...

The exception bit mask bits (DAIF) allow the exception events to be masked. The exception is not taken when the bit is set.

    Debug exceptions mask.A

    SError interrupt Process state mask, for example, asynchronous External Abort.I

    IRQ interrupt Process state mask.F

    FIQ interrupt Process state mask.

The SPSel field selects whether the current Exception level Stack Pointer or SP_EL0 should be used. This can be done at any Exception level, except EL0. This is discussed later in the chapter.

The IL field, when set, causes execution of the next instruction to trigger an exception. It is used in illegal execution returns, for example, trying to return to EL2 as AArch64 when it is configured for AArch32.

...

The Software Stepping (SS) bit is covered in Chapter 18 Debug. It is used by debuggers to execute a single instruction and then take a debug exception on the following instruction.

...

The ELR_ELn register is used to store the return address from an exception. The value in this register (actually several registers, as we have seen) is automatically written upon entry to an exception and is written to the PC as one of the effects of executing the ERET instruction used to return from exceptions. " -- https://developer.arm.com/docs/den0024/latest/10-aarch64-exception-handling/101-exception-handling-registers ... ELR_ELn contains the return address which is preferred for the specific exception type. For some exceptions, this is the address of the next instruction after the one which generated the exception. For example, when an SVC (system call) instruction is executed, we simply wish to return to the following instruction in the application. In other cases, we may wish to re-execute the instruction that generated the exception.

...

In addition to the SPSR and ELR registers, each Exception level has its own dedicated Stack Pointer register. These are named SP_EL0, SP_EL1, SP_EL2 and SP_EL3.

" -- https://developer.arm.com/docs/den0024/latest/10-aarch64-exception-handling/101-exception-handling-registers

In AArch64, exceptions may be either synchronous, or asynchronous...Sources of asynchronous exceptions are IRQ (normal priority interrupt), ... "

"When taking an exception, the processor state is stored in the relevant Saved Program Status Register (SPSR), in a similar way to the CPSR in ARMv7. The SPSR holds the value of PSTATE before taking an exception and is used to restore the value of PSTATE when executing an exception return. " [7]

"Cumulative saturation bit. This bit is set to 1 to indicate that an Advanced SIMD integer operation has saturated since a 0 was last written to this bit."

---

modeless 732 days ago [-]

Looks cool! Disappointed that there's no option to trap on integer overflow. Languages don't support it because processors don't support it, and processors don't support it because languages don't require it; a vicious cycle that someone needs to break.

...

renox 732 days ago [-]

> The dirty secret is: nobody cares because the ISA doesn't matter.

I disagree!! While normal operations don't matter, think about features like 'trap on integer overflow', if it was widespread in the popular ISAs we would have language which would use this semantic and as a result less bugs.

bsder 731 days ago [-]

> While normal operations don't matter, think about features like 'trap on integer overflow', if it was widespread in the popular ISAs we would have language which would use this semantic and as a result less bugs.

You need to study history, son. :)

All of the ISA's from the 70's and early 80's HAD an overflow feature. It was wiped out when we jumped to 32-bit architectures because overflow was so much less common.

...

 renox 728 days ago [-]

> All of the ISA's from the 70's and early 80's HAD an overflow feature.

All? The C was developed for the PDP11 which doesn't seem to have 'trap on overflow integer operations' like the MIPS has.

bsder 726 days ago [-]

Yeah, all. Or so close as to be indistinguishable from all.

If I'm really being pedantic, overflow detection was probably more prevalent than two's complement arithmetic at one point.

Directly from the PDP11 ISA:

BVC/BVS -- branch on overflow set/clear

Y'all need to go refresh your memory/study some history before arguing about this more. :)

renox 722 days ago [-]

And you'd better read before replying: I said trap on integer overflow, not branch on integer overflow. The former gives you overflow check 'for free', the latter reduce the code density, which impact the instruction cache, which can reduce the performance. And in the early days, performance was above everything else, CPU being so slow..

---

gioele 732 days ago [-]

> I never think "Gee, I wish I had a better ISA".

After programming with AltiVec?, going back to MMX/SSE made me wish I had a better ISA.

---

http://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html

---

" magic b3f20d0a moddate 8a9efc47 (Wed Apr 09 06:46:34 2008) code argcount 0 nlocals 0 stacksize 2 flags 0040 code 6404005c02005a00005a0100650000700700016501006f0d000164020047 65000047486e01000164030053 1 0 LOAD_CONST 4 ((1, 0)) 3 UNPACK_SEQUENCE 2 6 STORE_NAME 0 (a) 9 STORE_NAME 1 (b)

      2          12 LOAD_NAME                0 (a)
                 15 JUMP_IF_TRUE             7 (to 25)
                 18 POP_TOP
                 19 LOAD_NAME                1 (b)
                 22 JUMP_IF_FALSE           13 (to 38)
            >>   25 POP_TOP

      3          26 LOAD_CONST               2 ('Hello')
                 29 PRINT_ITEM
                 30 LOAD_NAME                0 (a)
                 33 PRINT_ITEM
                 34 PRINT_NEWLINE
                 35 JUMP_FORWARD             1 (to 39)
            >>   38 POP_TOP
            >>   39 LOAD_CONST               3 (None)
                 42 RETURN_VALUE
       consts
          1
          0
          'Hello'
          None
          (1, 0)
       names ('a', 'b')
       varnames ()
       freevars ()
       cellvars ()
       filename 'C:\\ned\\sample.py'
       name '<module>'
       firstlineno 1
       lnotab 0c010e01" -- http://nedbatchelder.com/blog/200804/the_structure_of_pyc_files.html

the flags are defined in http://svn.python.org/projects/python/tags/r31/Include/code.h :

/* Masks for co_flags above */
#define CO_OPTIMIZED	0x0001
#define CO_NEWLOCALS	0x0002
#define CO_VARARGS	0x0004
#define CO_VARKEYWORDS	0x0008
#define CO_NESTED       0x0010
#define CO_GENERATOR    0x0020
/* The CO_NOFREE flag is set if there are no free or cell variables.
   This information is redundant, but it allows a single flag test
   to determine whether there is any extra work to be done when the
   call frame it setup.
*/
#define CO_NOFREE       0x0040

dunno what those all mean though..

---

should we add the following field to the header, right after 'public key'?:

32 bytes: cryptographic 'signature' of this module; that is, SHA-256 checksum of this module (except with this entry replaced by 0), encrypted by the RSA private key corresponding to the public key in the previous entry (todo uh dont we have to add padding? i guess it's more complicated than what i wrote here.. what's the relevant standard? RSASSA-PKCS1-v1_5 PKCS #1 version 1.5 or the newer RSASSA-PSS? see http://crypto.stackexchange.com/questions/3850/is-rsassa-pkcs1-v1-5-a-good-signature-scheme-for-new-systems , see https://www.ietf.org/rfc/rfc3447.txt section 'encodings'; i don't even know how many bytes are needed here; http://crypto.stackexchange.com/a/3508 says 256 and http://stackoverflow.com/a/2118336/171761 says 128; that's actually getting larger than i'd like, mb put this into the package? see also https://tools.ietf.org/html/rfc6485 ), or if the previous entry is 0, then this is the unsigned SHA-256 checksum of the rest of this module, or if this entry is 0, then this module is not signed and no checksum has been computed

https://www.cryptsoft.com/pkcs11doc/v220/group__SEC__12__1__15__PKCS____1__RSA__PSS__SIGNATURE__WITH__SHA__1____SHA__256____SHA__384__OR__SHA__512.html says the output size is the RSA modulus length?

and then ethereum seems to use something else again (ecdsa?): http://ethereum.stackexchange.com/questions/710/how-can-i-verify-a-cryptographic-signature-that-was-produced-by-an-ethereum-addr

which is listed in https://en.wikipedia.org/wiki/Cryptography_standards#Digital_signature_standards

so some contenders are:

PKCS #1 variants: RSASSA-PSS RSASSA-PKCS1-v1_5

ecdsa

see http://crypto.stackexchange.com/questions/930/how-do-other-non-rsa-algorithms-compare-to-the-pkcs-1-standard

http://crypto.stackexchange.com/a/938 suggests that RSA is fastest for verification, comparing 2048 bit RSA (so 256 byte signature) and 224 bit ecdsa

http://crypto.stackexchange.com/questions/12299/ecc-key-size-and-signature-size

http://crypto.stackexchange.com/questions/25818/the-difference-in-size-between-ecdsa-output-and-hash-size

https://www.eldos.com/forum/read.php?FID=7&TID=2216

https://blog.cloudflare.com/ecdsa-the-digital-signature-algorithm-of-a-better-internet/

http://crypto.stackexchange.com/questions/3216/signatures-rsa-compared-to-ecdsa

https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signature_Algorithm#Key_and_signature-size_comparison_to_DSA says that the sizes are about the same for ecdsa and DSA (not RSA)

http://crypto.stackexchange.com/a/3218 says that ECDSA has smaller keys but RSA is faster to verify, which is what http://crypto.stackexchange.com/a/3218 said too. I think verification speed is more important for us, to speed up startup time when the implementation is paranoid enough to verify all the sigs. We can always have an embedded profile later that leaves out the crypto.

https://arxiv.org/ftp/arxiv/papers/1508/1508.00184.pdf

also apparently none of these are post-quantum: https://pqcrypto.org/ http://blogs.cisco.com/security/cisco-next-generation-encryption-and-postquantum-cryptography . Yuck, maybe we need a crypto profile byte in the format? I think i'll still resist that and insist that the crypto profile is specified with OotB? version.

i'm holding off for now because i don't even understand how many bytes it should be; we can put this stuff in the package, i guess?

even if the whole signature is not in here, should we put the low-order bits in here?

---

ok, based on the above, my best guess is that we want to use RSA2048, which yields 256 byte signatures.

---

http://legacy.python.org/dev/peps/pep-0480/ by the TUF guys suggests Ed25519. "A pure-Python implementation [15] of the Ed25519 signature scheme is available. Verification of Ed25519 signatures is fast even when performed in Python.". It appears to be faster than RSA for signature verification and has 32-byte keys and 64-byte signatures, while being immune to cache-timing attacks, branch-prediction side-channel attacks, and having a 2^128 security target (the side-channel immunity is not in the pure-Python implementation though).

OpenSSH? likes it; from their annoucement of deprecation of DSA ("DSS") keys: " Your best option is to generate new keys using newer types such as rsa +or ecdsa or ed25519. RSA keys will give you the greatest portability +with other clients/servers while ed25519 will get you the best security +with OpenSSH? (but requires recent versions of client & server "

---

ok, so now i like Ed25519.

removed this from the file format:

32 bytes: SHA-256 checksum of public key used to sign this file

---

many languages have 3 levels:

modules, which are namespaces and other internaly stuff within a package
a middle thingee, which is a bunch of modules and possibly other resources that are meant to always be together (Cargo crates; .NET assemblies)
packages, which is what the package manager is concerned with (eg Python pip; Rust cargo; .NET NuGet?; Java Maven/Ant/POM/Central Repository)

ideally we'd collapse that into one, to make things simple. However, i don't think we can fully collapse the 'package' level with the others at this time, because:

1) even if there is one canonical package manager integrated with the language, the package manager has many concerns outside of the language proper: interacting with platform package managers and installation conventions running scripts at install time communicating over the network with package repos, downloading package lists and packages crypto stuff involving remote repos, like maintaining lists of repo keys and possibly revocations published by the repo coordinated installation with or plugging into components written in other languages

2) package managers are rapidly evolving and so there is sometimes more than one popular system within a single language (witness Python); so if we bake it in to the bytecode file format, we probably won't get it right

so let's try to at least collapse the first two levels. This means that the Oot Bytecode module file format should contain the metadata needed to execute a pure Oot Bytecode project, even if this project involves multiple modules.

---

here's how Python does source maps:

" firstlineno 1 lnotab 0c010e01

The lnotab bytes are pairs of small integers, so this entry represents:

    0c 01: (12, 1)
    0e 01: (14, 1)

The two numbers in each pair are a bytecode offset delta and a source line number delta. The firstlineno value of 1 means that the bytecode at offset zero is line number 1. Each entry in the lnotab is then a delta to the bytecode offset and a delta to the line number to get to the next line. So bytecode offset 12 is line number 2, and bytecode offset 26 (12+14) is line number 3. The line numbers at the left of the disassembled bytecode are computed this way from firstlineno and lnotab. " -- [8]

---

.pyc files have the filepath to the source file. Do we want that?

http://bugs.python.org/issue1180193

i'm going to say no for now.. as that bug shows, it's better if the implementation is told to run the source file and then finds the cached bytecode file and remembers where the source file was. And if the bytecode is distributed independently then we don't need to show where the source file was on someone else's computer.

---

source maps: https://docs.google.com/document/d/1U1RGAehQwRypUTovF1KRlpiOFze0b-_2gc6fAH0KY0k/edit?pli=1#

note: supports 'dynamic source maps' eg for generated code that isnt in a file somewhere, you can include the source code itself in the sourcemap

---

source maps support finer granularity than just the line numbers given in Python, so i guess use something like those.

the source map standard looks good but since we're binary we can do better.

we can do binary instead of base64
we can probably do even better by using bits to say which fields are present in each segment instead of using a whole byte (';') as a delimiter. Also, since things like lines and columns are delta-encoded, i doubt that VLQ, which has a minimum size of 1 byte, is optimal. We could probably use a variant of VLQ with smaller bitfields.
probably keep the 32-bit limit of source maps
compress with gzip (the source map standard supports this too)

not sure if this complexity is worth it for V1 though. Mb for V1 just use a JSON with the mapping fields directly!

---

Rust Cargo has a .lock file, like that Ruby thing. Also i've read elsewhere that a solution to dependency hell is for an executable to remember which actual library versions it last successfully ran with.

So we should probably add an optional section for that (namely, a list of exact versions of dependencies that are known to work together) in the module file.

---

Rust Crates and Modules have the following attributes:

" 6.3.1 Crate-only attributes

    crate_name - specify the crate's crate name.
    crate_type - see linkage.
    feature - see compiler features.
    no_builtins - disable optimizing certain code patterns to invocations of library functions that are assumed to exist
    no_main - disable emitting the main symbol. Useful when some other object being linked to defines main.
    no_start - disable linking to the native crate, which specifies the "start" language item.
    no_std - disable linking to the std crate.
    plugin - load a list of named crates as compiler plugins, e.g. #![plugin(foo, bar)]. Optional arguments for each plugin, i.e. #![plugin(foo(... args ...))], are provided to the plugin's registrar function. The plugin feature gate is required to use this attribute.
    recursion_limit - Sets the maximum depth for potentially infinitely-recursive compile-time operations like auto-dereference or macro expansion. The default is #![recursion_limit="64"].

6.3.2 Module-only attributes

    no_implicit_prelude - disable injecting use std::prelude::* in this module.
    path - specifies the file to load the module from. #[path="foo.rs"] mod bar; is equivalent to mod bar { /* contents of foo.rs */ }. The path is taken relative to the directory that the current module is in." -- [9]

---

Often in low-level or minimalist systems, i see constraints on fundamental composite data structures.

the common ones:

acyclic (ie you can't make a cyclic data structure)
flat (ie you can have a composite whose items are scalars (eg a list of integers), but not a composite whose items are other composites (eg a list of lists))
homogeneous lists (ie you can't have lists whose items are of different types)

---

capability ideas:

Each 'pointer' (all pointers, or only the boxed ones?) is actually a tuple (address space, pointer/cursor within address space, capability). Here are the capability attributes:

'sealed' bit
if sealed or if 'u' permission: 'seal key' of the seal (can anyone seal to anyone else or is that restricted?)
permissions (applied to everything directly in the accessible address subspace 'inside' this 'pointer'):
- r: read
- w: write
- x: execute
- u: unseal
dict of custom permissions (keys are arbitrary names (like ruby :symbols; compiled into pairs (module from which this symbol was imported, symbol id in that module); values are bools)
accessible bounds within address space (it is an error to dereference a cursor outside of these bounds, but cursors may go outside of them as long as they are not dereferenced)
copy restrictions FOR THE POINTER not the stuff inside: one of
- 0: copying (outside of here) not allowed
- 1: copying always allowed
- a list of pairs ('r'

note: a pointer can always be copied WITHIN ITS DIRECTLY CONTAINING ADDRESS SUBSPACE

'w', pointer2

*); copying is allowed iff (a) there is some entry in the list whose permission is 'r' and whose value is either '*' or the direct parent of the pointer, or the pointer is in a register, and (b) the pointer is being copied to a register, or into some address space given by some pointer2 in the list whose permission was 'w'

note that a 'sealed' pointer gives no capabilities until it is unsealed, except via SealedCall?. This is useful for passing capabilities through untrusted middlemen without allowing those middlemen to use the capabilities. SealedCall? (see below) is useful when you want to give some party A the permission to execute code C, and you want that code C to have some permissions when A calls it, but you don't want to give those permissions to the caller A.

Even if a pointer doesn't give you the capability to do something to its target, you might have that capability from elsewhere for the same target. Subaddress spaces can be placed in an 'accessVia' restriction; if accessVia(X, Y), then any capabilities you have on Y can be applied to X (todo how is this done? capability registers? Data Capability Register and Program Counter Capability Register? Or an abstract 'capability set' composite data structure?).

note that the capability stuff can be narrowed but not widened by anyone with 'w' access to the 'pointer', eg you can remove 'x' access but you can't add 'x' access, you can further restrict copying but you can't widen copying, etc. If you want to pass a pointer through an untrusted middleman but then let the end recipient have more access, seal the pointer with more access, then place it inside a namespace, and put restrictions on that namespace, then send a pointer to that namespace with unseal permission to the end recipient through a trusted route (what if there is no trusted route? could add some way to 'address' the permissions, similar to public key encryption (actual public key encryption could be used when you want to serialize capabilities to disk); but the recipient could already just send you a namespace to which only they have 'r' access and to which you have 'w' access, allowing you to send 'em stuff securely).

there are two special 'capability registers', CAP and XCAP. Loads and stores are governed by CAP and branches and jumps are governed by XCAP. There are three ways to use a capability to load/store/jump to something in a subaddress space; (1) copy a capability which directly points to the desired subaddress space into CAP (for any permission except 'x') or XCAP (for 'x' permission), (2) copy a capability which points to a subaddress space which can Access the desired subaddress space into CAP or XCAP; (3) when you dereference a pointer you gain the capabilities on its subaddress space provided by that pointer (and pointers generated from this pointer inherit these capabilities).

need instructions to:

seal and unseal (can only unseal if you have a capability with 's' permission on a namespace with the item, whose 'seal key' matches the 'seal key' of the seal)
SealedCall?; jump to a location to inside a 'subroutine' namedspace, presumably the namespace has 'x' permission, and passing in a sealed 'data' namespace, under the constraint that the seal key of the subroutine namespace matches the seal key of the 'data' namespace; this gives the subroutine access to the data (according to the permissions on the 'data' namespace) even though you don't have access to it.

note that the fact that the above applies to the stuff inside each capability pointer, as if it were a memory region, implies that these subaddress spaces are more struct-like than list-like. So they are looking a lot like OOP objects.

---

i guess we must have 'capabilities' in primitive Oot, because the concept of following a pointer is implicit in the indirect address mode, and probably some of the auxiliary addr modes, and if you can opt to follow pointers while ignoring the capabilities then you could escape the capability restrictions

even if we went back to LOAD/STORE, the primitive LOAD and STORE commands would have to respect capabilities.

later: but mb we can do it all via subaddress space getters and setters, except for branching/jumping/'x'.

---

TODO Let's say you have a 'fat capability pointer' to some namespace x. Within x is a 'fat capability pointer' to some namespace y. The 'y' pointer includes Write capability, but the 'y' pointer does not. Clearly, you can't do x.y = z, because you don't have Write on x. But can you do x.y.1 = 3? Should we allow x's capabilities to allow or forbid this (invent 'transitive narrowing' and 'transitive expanding' capabilities, to allow x to say "whatever you can't do on me, you can't do on my children either, even if they would permit it" or "whatever you can do on me, you can do on my children too, even if they wouldn't permit it).

hmm, this is tough. Linear address space assembly fat capability pointers are more expressive here, because you could have a r/w pointer (all pointers fat capabilities) to a memory region from 1 to 1000, with a read-only pointer at 3 to 100, and read-only pointer at 4 to 5000; but since you already have a r/w pointer to 1-1000, you can still follow the pointer at 3 to 100 and then use the original pointer to gain write access, but this doesn't apply evenly to all 'children' of 3, eg you really do have only read-only access to 5000.

I suppose you could abstractly model that by having each 'namespace' optionally be in a hierarchy with the others (you might call all these namespaces 'regions'), eg call the original pointer 'x' and call 100 'y' and call 5000 'z', now we have x.3 == y and x.4 == z; now designate 'y' to be 'contained in' 'x'. Because y is 'contained in' x, any privs you have on x also apply to y. Note that x.3 might is pointing to y right now, but later on someone might reassign it x.3 = z, so it's not the pointer x.3 that is 'contained in' x, it's subaddress space y.

With linear memory you can see what's contained in what via arithmetic, but abstractly i guess you'd have to query it with some isContained operation. We can probably change the name from 'contained' to something less geometric and more easily memorable w/r/t security. 'Subordinate to'? 'Administrated by?' 'Accessable via'? 'Accessible to'? 'Accessed by'? 'AccessVia?'? just 'Access'?

Also, under that system, there would be additional value to being able to dereference fat capability pointers without gaining their capabilities (eg you can only deref if you already have access to the target via some other means). (also, how different is that from sealing? i guess sealing refers to THIS pointer, whereas what i am talking about refers to child pointers)

OK, so above i added:

'c' permission
accessVia relation between subaddress spaces

Still need a way to specify which capabilities are being used, or more abstractly, currently being 'held' for accessVia.

todo, we need a data capability register and a PC capability register

---

removed 'use child capabilities', moved to here:

c: use child capabilities

'use child capabilities' means that if there are pointer with capabilities in the subaddress space given by this pointer, you can use those capabilities; if you don't have that permission, then those appear to you as pointers with no capabilities (and if you copy them, the copies will be ordinary pointers with no capabilities, not pointers with merely unusable capabilities). (todo is this too much complication for its usefulness? this means that we can't allow ordinary copying of capabilities, even opaquely)

the reason i removed it is that, as i noted, it's too much complication for a small gain. If you want to pass someone some memory with capability-less pointers, just put capability-less pointers in there. This permission would have enabled those who have 'c' permission to see the capabilities where others see capability-less pointers.

The real problem is that i am thinking about not having capabilities in the primitive machine, and just having them in boxing and in the subaddress space getters and setters (well, and maybe in branching?). This would make implementation of the primitives easier. But it also means that without boxing, someone could still copy any boxed capability. So they could copy them into some other namespace in which they DO have 'c' permission.

---

CAP and XCAP registers seem allowable to me because, although in some sense they are providing hidden inputs to everything and hence are a hidden dependency, all this dependency does is determine whether or not an error is raised, not what the output is. Of course you could do computation using that as a 'weird machine'. But under the assumption that there are no access violations, they aren't affecting the computation.

---

thinking more about 'c: use child capabilities'. I think it WOULD be possible, by making a 'c'-less capability pointer just provide a special view on the address space in which capabilities are replaced by their plain pointers, provided that:

ordinary pointers have NO capabilities, rather than ALL capabilities
you make it so that if you copy a child capability, the copy is just an ordinary, capability-less pointer; OR that it is an ordinary pointer. But this would greatly restrict its use; eg it prevents an unpriviliged allocator from moving a block of stuff including capabilities. You could say that it becomes an ordinary pointer UNLESS the copy is within the same subaddress space (btw this probably shouldn't be a special instruction if you don't want to make eg allocators have to know to use weird instructions). But with an unboxing model in which the unboxing code does NOT get to wrap the entire instruction, just provide an unboxed input to the instruction, how would this check be made?

Similar or greater problems plague the 'copy restrictions FOR THE POINTER not the stuff inside' part. If the subaddress spaces are doing the work of capabilities, then an ordinary CPY with no boxing can copy a capability without checking capabilities.

Both these things only seem to help for:

confused deputies. Eg if you are worried about the correctness of your memory allocator, it would be nice to add an extra layer of protection in case they try to copy a capability to someone who shouldn't have it.
performance. Eg if 'c' just turned things into plain pointers, and if you have no need of deputies who can pass this to someone else and then have the pointers turn back into capabilities for the someone else, then you could emulate 'c' just by actually creating a full copy of the subaddress space in which capabilities were replaced by plain pointers.

Note also that sealing already helps a little with the confused deputies; your confused memory allocator can be trusted with sealed capabilities just fine.

So it seems to me that we should consider dropping these. Also, rwx+sealing seems like an obvious choice for the 'core' of this stuff. The fact that adding this other stuff in is a little complex (i've come back on multiple days rethinking various related details of the other stuff), while this putative core stuff doesn't seem to be causing so much redesign, suggests that the putative core stuff really is more core (although that could just be due to my familiarity of thinking in terms of r/w/x).

Also, regarding capability registers, i think we do need some facility for holding a set of multiple capabilities in the 'register' at once. The reason is, let's say you input a capability into a custom instruction. According to our unboxing model, the capability is unboxed at input and all the custom instruction gets is a plain instruction; the point of this is to allow custom instruction implementations to be written without thinking about boxing. Within the custom instruction, we might call some other subroutine which makes use of some different capability; but that subroutine may still need access to the original input capability, too. But it can't even know about that original one, because it was kept ignorant of that by unboxing. So, the obvious solution is to allow unboxing to wrap the instruction 'a little bit', namely, to remember which capabilities were given to the instruction for the dynamic extent of the execution of that instruction; and to make this composable, this is transparently unioned with any other capabilities that will be acquired down the line.

See the following section for the new model.

---

(copied to ootAssemblyThoughts)

'sealed' bit
if sealed or if 'u' permission: 'seal key' of the seal (can anyone seal to any key or is that restricted?)
permissions (applied to everything directly in the accessible address subspace 'inside' this 'pointer'):
- r: read
- w: write
- x: execute
- u: unseal
dict of custom permissions (keys are arbitrary names (like ruby :symbols; compiled into pairs (module from which this symbol was imported, symbol id in that module); values are bools)
plain pointer to subaddress space

There is a 'capability register' CAP, which holds a CAPABILITY SET, not a single capability. Unboxing is capability-aware in the following sense; upon unboxing a capabilty input, the current value of CAP is saved on an internal stack, and CAP is replaced by CAP unioned with the new capability; upon the end of this instruction, however, the internal stack is popped and the previous CAP is restored.

All inter-subaddress-space reads must check the 'r' capability, all inter-subaddress-space writes must check the 'w' capability, all inter-subaddress-space jumps and branches must check the 'x' capability.

Array bounds are represented by the subaddress space, not by the capability; so narrowing such bounds is done by creating a new subaddress space (which is a view onto a subset of an existing one).

There is a transitive 'Access' relation on subaddress spaces; if one subaddress space X 'has Access to' or 'Access'es another one Y, then if you have a capability on X it applies to Y too (in order to be able to compute transitivity, this implies that the internal representation of each address space stores a list of all other address spaces that it Accesses).

---

so i guess the way that the send/recv primitives could work is:

future = send(channel, msg)
future = recv(channel)

and there could be a convention that upon program start, a certain register holds a pointer to a subaddress space that is an array of channels. And/or, we could have OPEN and CLOSE.

a problem with that is that i was hoping to not bake futures/promises into the primitives, instead building those on top.

we could do it with callbacks:

send(channel, msg, callback)
recv(channel, dest, callback) (note that 'dest' is written to also; should 'dest' be the first argument? channel is mutated though since a msg is consumed)

(note that passing 0 in the callback here means 'dont bother calling back')

or we could have a mmapped 'completion port':

send(channel, msg, completion_addr)
recv(channel, msg, completion_addr)

The program should pre-write a '1' to the completion_addr; completion_addr would have a '0' written to it when the send or recv completed successfully, and a negative number if it fails. (note that here all 3 arguments to recv are mutated).

I supposed out of the three alternatives (future, callback, completion_addr), callback is the most expressive; because the callback could set the completion port, and a future can be built out of callbacks (eg by having the callback update the future's state). Callbacks are a little confusing because you may have to get into exactly when the callback is allowed to execute (eg you have to think about things similar to Javascript microtasks), but i think in our case we should just say "they can execute whenever' because that's the most general.

so i guess callbacks are the way to go.

later, when building futures on top, could look at how Rust did it: http://aturon.github.io/blog/2016/08/11/futures/

i should we should include OPEN and CLOSE in the primitives too.

regarding POLL: what about providing an expression to match the message against, like Erlang? What about listening to multiple channels at once? Or is that higher-level? I guess it is; (a) having a selector expression to match against can be pseudo-emulated by just taking all the messages and filtering them; i say 'pseudo' because that cannot be made as efficient as a platform primitive that lets the platform only request certain messages, and (b) similiarly to multiple channels, you can emulate this by round-robin polling each channel with a timeout although again that cannot be made as efficient as a platform primitive that takes a list of channels and sleeps until one is free. But both of these would require higher-level data structures, which means choosing representations (for the selector expression, or for the list of channels), which we'd prefer to avoid here; and both of them can be built in the inefficient way out of primitives but then overridden and implemented properly by implementations.

---

(probably not a very good) capabilities idea: a way to prevent a process from getting 'x' permission to newly malloc'd memory? Perhaps you have examined code and verified that it is safe to execute, but you cannot verify that it cannot write unsafe code and then 'eval' it. This would allow you to trust that a given piece of code can't transfer control to code it writes. This is difficult as it is a global, rather than memory-region-based permission. And it may not be too important, because it only helps with confused deputies (you sorta trust code but you're worried it may have bugs that an attacker can exploit to cause it to write attacker code into a page and then transfer control to it) or with code verification (you can mostly verify some untrusted code but you can't quite tell if it might transfer control to memory it wrote into).

---

should you need 'seal' permission to seal to an address?

---

ok i decided to simplify things by saying capabilities can only be stored in special address subspaces created by 'capmalloc'. That way you don't have to tag every space in memory to mark if they are a capability or not.

Once doing this, i saw a simpler way to get some of what could be achieved by having permissions to read and write capabilities -- just have a permission for whether you are allowed to call CAPMALLOC. So, say you have enough 'god's-eye view' to know for a fact that some untrusted process X doesn't have access to any capmalloc'd memory. So deny them CAPMALLOC permission when you run them, and now they have no way to pass on their capabilities to other processes (because even if they have some capabilities in their CAP register, there's nowhere they can copy them into).

So, i added the 'c' permission, which is per-process, not per-subaddress space.

Now that we broke the seal on per-process permissions, i went ahead and added the 'allocate executable memory' permission discussed above. And i also added an 'm' capability for whether you can call MALLOC at all.

---

(this comment was moved from another file, it actually predates the last few comments) The capability stuff seems to go well with our assigning different behaviors to different subaddress spaces (eg if boxing is based on address space type). We can just have the range of a capability be co-extensive with a subaddress spaces (eg instead of a 'fat pointer' that specifies array bounds along with the capability, just make the granularity of a capability equal to an address subspace). In fact the addr space pointers could be the capabilities. (but don't you want to be able to narrow it, eg to start with a pointer bounded between 100 and 200, and then narrow its bounds to between 100 and 110? Yes, so we should have an operation that takes an address subspace and creates a new pointer to some subset of it, eg another view onto the same address subspace).

---