Bayle Shanks's website: proj-oot-ootAssemblyNotes11

Y'know, after the addition of the capabilities instructions and the various complexities of the OotB? i think OotB? is just getting too complicated. It's no longer an 'implement this on your platform in one afternoon' kind of thing.

So let's subset it yet again. SootB? is a simple stack-machine subset of OotB?.

---

note that OotB? is also rotated "Boot", which is appropriate.

---

we should have a compiler from OotB? to SootB?, not just an interpreter.

---

The need for SootB? calls into question the value of OotB? at all (as more than an implementation detail). If we are already compiling Oot Core into OotB?, and OotB? into SootB?, then why not just compile Oot Core into SootB?? Recall that two of the main motivations for a separate OotB? were (a) to make it dead easy to bootstrap on a new platform, (b) to have a separate layer of services (like GC, greenthreads, etc) underneath Oot Core. (and of course there's also the motivations of execution via a bytecode interpreter runtime, and having a platform-independent pre-compiled object format for storage of code, but these things are implementation details).

But now we have SootB?. And some non-trivial OotB? services (eg capabilities; forking) are being implemented in SootB?.

If we have SootB? then there's (a) already. As for (b), why not just implement the services themselves in Oot Core and then compile them from Oot Core directly into SootB?? Perhaps this compiliation might pass through OotB? as an implementation detail, perhaps not.

This would require Oot Core having a subset and a 'mode' in which the services are not relied upon. That seems doable. Like RPython; exceptions actually act differently in Python and in RPython (in some cases, in RPython, an exception is simply not generated at all if there is no exception handler matching it).

OotB? might still exist (as an implementation detail in the compiler from restricted Oot Core and SootB?, and in the module file format), and thinking about it in detail will still inform the semantics of this Oot Core 'mode'. But it may not be something that either Oot users or porters have to think about.

This is disappointing to me because i've developed an affection for OotB?. But at the same time i have been feeling that i've been placing too much importance onto what should be implementation details, and premature optimization. How many addressing modes should OotB? really have? Should it have indirect offset? Should it have indirect, or instead have LOAD/STORE? I dunno, that stuff is an optimization. Also, i feel that having three layers (OotB?, Oot Core, Oot) is already pushing it, and having 4 layers is too much (as an implementation detail that would be fine, just not as part of the main design).

So if i decide that what i've written above is correct, then maybe OotB? is dead. But it will live on in the form of a fairly close mapping between restricted Oot Core, and OotB?, at least initially.

There are three hitches.

First, some of these services (like capabilities and greenthreads) really do seem like they have to run 'beneath' ordinary Oot Core. If the non-primitive 'custom functions' in the OotB? stdlib are to be written in a manner oblivious to capabilities and greenthreads, then these implementations may have to run 'on top of' a lower-level execution environment. Perhaps we could come up with some metaprogrammy stuff in Oot Core to express this?

Second, a huge benefit of OotB? was the incrementality of porting. The porter initially only has to implement the primitives, and then they can add in platform-native implementations of the stdlib custom instructions one by one, and see an increase in execution speed each time they do so. We would want to preserve this if we switch to Restricted Oot Core, perhaps by having an FFI and exposing an interface where the porter can register a native function as replacing a library function; but note that we want to be able to INLINE the custom instructions. In addition, perhaps the implemention of some custom instructions is best done by changing the interpreter in a way more globally than can be expressed by giving a single function to be inlined. Again, perhaps we could come up with some metaprogrammy stuff to express this in Oot Core. We don't want to tell the porter that they have to write an entire parser for Restricted Oot Core before they can start replacing the custom instructions with platform-native implementations; that's one of the issues i have with an RPython approach.

Third, what if we want part of Core Oot to not be very imperative, like eg F-lite or Reduceron Template Language?

---

fantom:

" We built Fantom from the ground up to tackle portability between these VMs. Fantom's source language compiles into fcode - a bytecode representation that can be translated into both Java bytecode and IL easily. This translation is typically done at runtime, which enables you to deploy Fantom modules as a single file and have them run on either VM.

But getting a language to run on both Java and .NET is the easy part - in fact there are many solutions to this problem. The hard part is getting portable APIs. Fantom provides a set of APIs which abstract away the Java and .NET APIs. We actually consider this one of Fantom's primary benefits, because it gives us a chance to develop a suite of system APIs that are elegant and easy to use compared to the Java and .NET counter parts. "

---

i dunno, maybe having Oot Core (and maybe even Root Core) and OotB? and SootB? is okay. They do each seem to serve a function:

SootB?: easy bootstrapping. I'm surprised that there isn't already a language for this, but it seems there isn't. LLVM and WebAssembly? might do, however. OotB? or Root Core or Oot Core is compiled to this, so all the porter has to do is implemented SootB? for a naive bootstrap.
OotB?: easy, incremental implementation of platform-native primitives in a way that is conducive to interoperability with most popular languages. After SootB? is implemented, the implementor can incrementally expand it to an OotB? implementation, and then incrementally implement platform-native primitives and interop. Most services are on top of OotB? but capabilities are below it.
Root Core: 'restricted Oot', like RPython; this is the language in which Oot services and an Oot Core implementation is written. It is compiled to OotB? and possibly to SootB?.
Oot Core: The small, explicit, verbose HLL
Oot: Oot Core is metaprogrammed into Oot.

GHC has a buncha steps:

Haskell AST -> Core -> STG -> (C--/Cmm or LLVM or native code)

Clearly, Haskell ~= Oot, GHC Core ~= Oot Core, LLVM ~= SootB?. STG is either ~= Root Core or ~= OotB?. Cmm is somewhat like OotB? but not entirely b/c i think Cmm already provides lots of services.

Root Core is like RPython. OotB? is also like Parrot or perhaps Parrot M0.

OotB?/Root Core is like Shen KLambda (Kl)

Could just combine OotB? and SootB?. This is what i was originally thinking, but OotB? is ballooning, esp. with capabilities and the module loading file format.

---

look, if OotB? made interoperability better it would totally be worth it, but who knows if it actually will?

---

again, just go with LLVM instead of SootB?? No, too hard to port: http://llvm.org/docs/WritingAnLLVMBackend.html

WebAssembly?? Maybe, but again, probably too hard: https://webassembly.github.io/

KLambda? Probably too hard: http://shenlanguage.org/Documentation/shendoc.htm#Kl

---

ELF program header types:

" The type may be one of the following. The numbers indicate the value of the keyword.

PT_NULL (0) Indicates an unused program header. PT_LOAD (1) Indicates that this program header describes a segment to be loaded from the file. PT_DYNAMIC (2) Indicates a segment where dynamic linking information can be found. PT_INTERP (3) Indicates a segment where the name of the program interpreter may be found. PT_NOTE (4) Indicates a segment holding note information. PT_SHLIB (5) A reserved program header type, defined but not specified by the ELF ABI. PT_PHDR (6) Indicates a segment where the program headers may be found. expression An expression giving the numeric type of the program header. This may be used for types not defined above. " -- http://ftp.gnu.org/old-gnu/Manuals/ld-2.9.1/html_node/ld_23.html

" 0=NULL Null segment (optional)

    1=LOAD	Loadable segment (code, data, bss, etc.)

    2=DYNAMIC	Dynamic linking information

    3=INTERP	Path to dynamic linker, e.g. /lib/ld.so

    4=NOTE	OS-defined

    5=SHLIB	Reserved. DO NOT USE.

    6=PHDR	The program header table itself" -- http://geezer.osdevbrasil.net/osd/exec/elf.txt

---

" I don't mean to imply that there are no new instructions that will ever need to be added! I think popcount is an example that is very painful to handle in software, and the idiom in assembly is probably too big to handle via macro-op fusion. I'm also very excited to eventually have a vector ISA in RISC-V.

Rather, my abstract is fighting against a whole slew of instructions that exist in current ISAs (or that people want to add to RISC-V) that don't need to be there. The load-pair/load-increment types that you see in ARMv8 come to mind. "

 cperciva 35 days ago [-]

Can you recognize and automatically perform macro-op fusion on AES computations? How about SHA256? Ok, let's make it easy -- CRC32?

"saturating adds and multiplies that came with the DSP requirements, or the SIMD extensions that came with graphics"

---

Taniwha 35 days ago [-]

The thing is the base x86 architecture ISNT very CISCy, there's only simple addressing modes (no double indirect modes), no addressing side effects to undo on exceptions (auto increments), only one TLB dip per instruction

(the push instruction is really the only excepetion)

compared to its compatriots (68k/vax/NSwhatever etc) it was positively RISCy in comparison - a vax instruction could take something like 29 TLB misses, make 29 memory accesses - a 68020 threw a horrible mess of microstate spew on a TLB miss so it could make progress (made unix signals implementation a nightmare, where do you safely put that stuff?)

---

i guess there's no real reason that SootB? would need to be a subset of OotB?. Except that it encourages people to go ahead and extend their shiny new SootB? implementation into a full OotB? implementation. Which is a good reason.

But... there are less than 16 core instructions. Which means.. if we were willing to make LOADK and J consume two instruction fields (be variable length).. we could have 2 instructions per byte! That's 16 per 64 bits, ~twice as many as our SHORT encoding in OotB?, and less complicated.

---

And what if we just threw out OotB? and only had SootB?, Root Core, Oot Core, and Oot?

I guess it depends on if OotB? really makes interoperability any easier, or if the implementors can plug into Oot Core and Root Core just as easily.

We'd still need some sort of Root Core and/or Oot Core bytecode serialization, i guess, which may well end up looking a lot like OotB?.. but that would then be an implementation detail.

I feel a little bad about abandoning OotB? after i've grown to like it, but hey, you gotta do what you gotta do. And, as i've said, OotB? would certainly inform the design of Root Core/Oot Core.

SootB?/Root Core/Oot Core/Oot does seem more in keeping with the original idea of Oot -- SootB? is ridiculously simple, and Root Core and Oot Core will be fairly simple too.

btw i dont think i've ever explicitly written that 'Root Core' stands for Restricted Oot Core, in analogy with RPython.

---

my current idea for Root Core is basically:

fully parenthesized (nonassociative precedenceless prefix) syntax for expressions
kind of like the assembly syntax for statements, but with explicit a = b style assignments, and with symbolic variable names instead of numbers
there is a data stack and a call stack (and a PC?) and gotos (can you goto from within an expression?)
can only call fns with up to 3 arguments (2 if they are RHS of =s), unless you use the 'call' function on a list of arguments; the 'call' function affects the call stack, the others don't
the 'call' function makes an 'official' stack frame that can be metaprogrammed. Primitive 'custom instructions' cannot, although you can create function objects out of them and call them as functions
back to 2k, not 64k, limits for stuff (stuff like # of fns in module)
bring in all the limits: flat (?), acyclic, homogeneous arrays of scalars, fixed length heterogenous tuples
no metaprogramming or introspection
static typing (but with HoFs?, generics)
basically OotB? without the bytecode
same primitive ops as OotB?
OotB? is probably still what goes in the precompiled bytecode module file, though
a way to turn a function into a function object

of course if it's just like OotB? then it isn't really like Oot Core. So i should probably wait until i have a better idea about Oot Core.

eg how can you have both fixed length heterogenous tuples and homogeneous lists if in Oot everything is a kind of graph? eg does Oot have a data stack?

---

here's the problem with unifying Root Core and OotB? (and possibly a problem with Root Core at all):

Root Core is the language of the Oot implementation.

OotB? is easy to implement inside of another language, and designed to make it easy to interoperate in that fashion.

Now, what does 'Root Core is the language of the Oot implementation' mean? Does it mean that the Oot implementation source code is written in this language? Yes; analogously, the PyPy? Python implementation is written in RPython.

But, RPython is restricted Python, not restricted 'Core Python'. Oot is self-hosting; the Oot implementation will be written in a restricted subset of Oot (ROot), not in a restricted subset of its core language.

ROot will compile to Root Core. But not because anyone writes Root Core in source code; just because they write Root source code, and the Oot compiler naturally compiled that to Root Core. Similarly, no one write the OotB? source for the Oot implementation (except maybe me, for the first one); they wrote ROot, and that was compiled into OotB?.

Furthermore, the RPython guys have their own compiler. For us, though, ROot is probably just a restricted style of Oot (probably with lots of restrictive annotations), and the ordinary Oot compiler should be able to translate it into Root Core, in which case that is just the image of ROot under compilation with the Oot compiler.

So, Root Core is not being directly written by anyone. Neither is OotB?. But Root Core is just a restricted subset of Oot Core. So it has HLL things like symbolic variable names (well, assembly can have variable names too, but anyways..). Whereas OotB? is designed to be implemented easily by the porters. So it has easy to implement things like variable numbers.

Interesting note: above, i said let's switch back to 2k from 64k for Root Core. This is because the only application which is really supported in Root Core is the Oot implementation, and we don't mind imposing this 2k stuff on ourselves, even though we mind imposing it on user code. But why didn't we think that way with OotB?? Because we thought of OotB? as a bytecode target for all Oot.

But the Oot implementation is what is interpreting or compiling Oot Core. Perhaps OotB? will be an intermediate step, perhaps not. It might be the serialization format for Oot Core, though.

In fact, we might think of "ROotB?" as the codomain of the Oot implementation under compilation. ROotB? might be slightly different from OotB?; for example, the Oot implementation is statically typed, so ROotB? doesn't need to support the DYN type.

SootB? is clearly too far removed from the IR of actual Oot code that will be executed to give any interop benefits from implementing it on a HLL platform (like Python). So this would really only be for the sake of making naive porting dead easy. Which means it's design is decoupled from the design of the rest of the language. So we don't have to worry about it.

As for OotB?, if the Oot implementation has to run an interpreter or compiler over incoming OotB? in order to implement eg capabilities, then again there isn't much interop benefit, because eg the 'variables' observed by the Python implementation of OotB? will be variables in the Oot implementation, never variables in the Oot program. But in fact there will be tons of variables like this inside the Oot services implementation.

All of which has me thinking that this idea of implementation of OotB? leading to interop may not be correct. Interop will probably either come from FFI-ish APIs (like the SYSCALL idea), or from direct translation all the way from Oot Core to the native HLL.

In addition, the question of whether or not program OotB? is directly run on an ROotB? implementation, or if the ROotB? implementation runs the OotB? implementation which runs the program, seems like it should be an implementation detail, not something exposed to the porters.

And the idea that the Oot implementation should start with a very few near-universal primitives, and incrementally build other functions, each of which can be overridden by a platform native version, becomes more of a program structuring choice, and a feature of our interop system (maybe module system, etc), than a constraint on some IL. We draw upon OotB? to remember what makes things easy to implement, but the OotB? implementation itself is just an implementation detail.

On the other hand, if we HAVE a bytecode format like OotB?, don't we want to expose it? This will make it easier for porters to port Oot, wouldn't it?

---

https://news.ycombinator.com/item?id=11462308

vbezhenar 128 days ago [-]

Efficient integer overflow should really be implemented in the processor. This problem exists and often leads to security holes. And developer in 99% cases is not prepared to deal with overflows, so it shouldn't happen unless developer explicitly used some overflowing operator.

Unfortunately almost all languages use overflow by default. I only know that C language has unspecified behavior for overflowing signed integers, so it's legal to crash here. That's a sad state of things.

spc476 128 days ago [-]

The sad thing is---there are processors that can do that. The MIPS can trap on overflow (the ADD instruction will do that) but C compilers tend to use the non-trapping versions of said instructions (ADDU---add unsigned).

The VAX architecture (to go back some thirty-five years) can trap or not---there's a bit in the flags register that controls overflow traps.

On the x86 line, you can trap on overflow, but the check has to happen after each instruction. The obvious method (using INTO) is slow (http://boston.conman.org/2015/09/05.2) while using a conditional jump (JO) doesn't incur much overhead (http://boston.conman.org/2015/09/07.1) but it does lead to overall slower code because you have to use instructions that set the overflow flag. LEA (can be used to add and multiply) never sets the overflow flag, so instead of using a single instruction to, say, multiply by 16 and add 7, you have to use multiple instructions.

chrisseaton 128 days ago [-]

> Efficient integer overflow should really be implemented in the processor.

But it already is. As a comment the article links to describes, add and then jo is fused by the processor, so the jo causes no extra uops and we can assume it's the same performance as a real add_and_jo instruction would be. The article also says the that the impact on the icache is insignificant so the fact that it's two instructions rather than one until the fusing also doesn't matter.

So what else did you want?

(Edit: spc476 points out that addressing mode arithmetic can't be used with jo, so there is that)

caf 127 days ago [-]

The fused uop also can't be executed on as many ports as a plain add.

Animats 128 days ago [-]

VAXen did that. Integer overflow checking could be enabled on a per-function basis by setting the entry mask read by the CALL function at function entry.[1] However, you couldn't configure this for C's "unsigned overflow OK, signed overflow is an error" semantics. There was only one bit for enabling integer overflow detection. (You could also enable decimal overflow detection, in the unlikely event you were using the base 10 arithmetic instructions.) Overflow caused an exception at the machine level, which UNIX got as a signal.

I once rebuilt the UNIX tools on a VAX with the integer overflow detection bit set. About half of them still worked.

[1] http://odl.sysworks.biz/disk$axpdocmar981/opsys/vmsos71/ovms...

WalterBright? 128 days ago [-]

It's one thing to use a jo (branch on overflow) instruction. But there's another problem:

    a = b + c + 47;

can be implemented as:

    LEA a, 47[b+c]

(a, b and c are assigned to registers)

This does not generate any trap or flag when it overflows, it just wraps. It's an important optimization, too.

_chris_ 128 days ago [-]

> Efficient integer overflow should really be implemented in the processor.

I disagree. It greatly adds to the complexity, verification, cost, and power for the HW to do this, which is especially onerous in that almost nobody cares or runs code that exercises it!

Instead, the real performance hit comes from the compiler being forced to serialize otherwise independent instruction streams. So why waste all that HW effort when you can pay it all in the SW when and only when you want to use it? The user probably won't even notice the difference between SW and HW overflow support on real workloads.

---

https://news.ycombinator.com/item?id=11462308

Animats 128 days ago [-]

Excellent! I've wanted compilers to optimize overflow checks and subscript checks since my proof of correctness work decades ago. This was done in the 1980s for Pascal, but the technology was forgotten. It's finally going mainstream.

The next optimization, which the parent article doesn't cover, is hoisting checks out of loops. This is a huge win for subscript checks. Most newer compilers optimize FOR loops, at least.

Having detected an error, what do you do? An exception would be nice, but few languages other than Ada handle machine level exceptions well. Aborting the whole program is drastic.

nikic 128 days ago [-]

Isn't value range inference for eliminating overflow / out-of-bounds checks, and LCM (/PRE) for hoisting checks out of loops pretty standard in any optimizing compiler nowadays?

chrisseaton 128 days ago [-]

Do you have a reference for the proof you did? I'm interested in optimising overflow checks in Ruby.

Animats 128 days ago [-]

The system: [1] The Nelson-Oppen decision procedure: [2]

There's a subset of of arithmetic and logic that's completely decidable. It includes addition, subtraction, multiplication by constants, "if", the theory of arrays, a similar theory of structures, and the Boolean operators (on bools, not integers.) This is enough for most subscript and overflow checks. Complete solvers exist for this subset. If you try to prove all constraints that need run-time checks with this approach, you can eliminate maybe 80-90% of run time checks as unnecessary.

Some checks cannot be eliminated, but can be hoisted out of loops. This means making one check before entering the loop, rather than one on each iteration. FOR statements usually can be optimized in this way. That's almost a no-brainer in modern compilers. The "Go" compiler does this. Heavier machinery is needed for more complex loops.

[1] http://www.animats.com/papers/verifier/verifiermanual.pdf

[2] http://www.cs.cmu.edu/~joshuad/NelsonOppen.pdf

---

to emphasize:

" Animats 128 days ago [-]

The system: [1] The Nelson-Oppen decision procedure: [2]

[1] http://www.animats.com/papers/verifier/verifiermanual.pdf

[2] http://www.cs.cmu.edu/~joshuad/NelsonOppen.pdf "

links 2 abstract:

A method for combining decision procedures for several theories into a single decision procedure for their combination is described, and a simplifier based on this method is discussed. The simplifier finds a normal form for any expression formed from individual variables, the usual Boolean connectives, the equality predicate =, the conditional function if-then-else, the integers, the arithmetic functions and predicates +, -, and _<, the Lisp functions and predicates car, cdr, cons, and atom, the functions store and select for storing into and selecting from arrays, and uninterpreted function symbols. If the expression is a theorem it is simplified to the constant true, so the simplifier can be used as a decision procedure for the quantifier-free theory containing these functions and predicates. The simplifier is currently used in the Stanford Pascal Verifier. Key Words and Phrases: program verification, program manipulation, theorem proving, decidability, simplificatio

---

PeCaN? 128 days ago [-]

GCC[1] and Clang[2] both support this with __builtin_add_overflow and family:

  int a = get_number_a(), b = get_number_b(), c;
  if (__builtin_add_overflow(a, b, &c))  // Note: Unlike your example, returns true iff overflow occurred
    uh_oh();
  else
    everything_is_ok();

These are quite efficient on many common architectures.

MSVC has SafeInt?[3] which is more awkward but gets the job done.

https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html

http://clang.llvm.org/docs/LanguageExtensions.html#checked-arithmetic-builtins

https://msdn.microsoft.com/en-us/library/dd570023.aspx

---

"I'm following the lowRISC project because they propose to add 2 tag bits to the RISC-V architecture." -- https://news.ycombinator.com/item?id=9039073

---

"MC68K was pretty regular, NS32032 highly regular, both CISC."

---

" The big advantage that RISC has these days is that fixed width instructions are easy on the decoder. You can also have variable width instruction that use UTF8esque byte marking to make things easier on the decoder but x86 doesn't have anything like that. But then again separating them entirely in the ISA makes thing easier on the designers.

Oh, and it's a bad idea to touch memory multiple times in a single instruction on a modern machine but Intel's optimization manuals warn you not to do that and compilers abide by those warnings. If you want an ISA feature that's really hard to design into a high performance uArch then there's indirect addressing but unlike most CISC ISAs x86 managed to avoid that one. "

---

yokohummer7 106 days ago [-]

> The big advantage that RISC has these days is that fixed width instructions are easy on the decoder.

One thing that I've wondered is how much more efforts are needed to decode variable-width instructions. Decoding itself sounds fairly easy (but frankly I don't know any details), to the point that the amount of time needed for loading/storing/calculating overwhelms that of decoding. But decoding should happen extremely fast to fill the pipeline, so the speed might still matter. Can decoding instructions be an actual bottleneck?

dbcurtis 106 days ago [-]

OH, my, yes, it can become a bottleneck. Disclaimer: It has been a good many years since I was privy to the innards of an X86.

In the X86, it is possible for an instruction to be from 1 to 15 bytes long. (Maybe more today? It was 15 when I cared.) All you can tell from looking at the first byte is that it is either one byte longer than one byte. All you can tell from the 2nd byte is that it is either 2 bytes or longer than 2 bytes, and so on. When you walk all the way out to the 15th byte, you might find a MOD/RM field, which may contain invalid combinations. Finally you have enough information to raise (or not) the illegal instruction exception. That is one very nasty equation.

Just one example of how variable instructions can become annoying to a logic designer. OTOH, some machines are very regular in how instruction length is specified -- in IBM 370 code, for instance, you can look at the first 2 bits and know the instruction width. X86 is an example of organic accumulation of features over time leading to a large collection of special cases.

---

cesarb 106 days ago [-]

> Can decoding instructions be an actual bottleneck?

Yes....

---

aidenn0 106 days ago [-]

A few comments:

Yes, MIPS assembly is terrible, though I hear that the newer ISA revisions are better (the one I worked with most heavily was the 5k, and systems programming on it was very unpleasant).

Power and MIPS also both now have thumb-2 alike instruction encodings, POWER VLE and MIPS16e, respectively. It turns out that code size matters.

Lastly, RISC no longer means what it used to. It basically is used today to just mean a load-store architecture, as you now have variable-length instructions, multi-cycle arithmetic instructions, out-of-order superscalar chips labled "RISC"

When IBM came out with the POWER ISA, I seem to recall that one of the authors of the abacus book claimed it wasn't simple enough to be RISC, which is a quaint thought these days.

Symmetry 106 days ago [-]

It's sort of funny that ARM, which is really the CISCiest of the old RISC instruction sets, has ended up being the most successful one.

http://userpages.umbc.edu/~vijay/mashey.on.risc.html

renox 105 days ago [-]

ARM, which is really the CISCiest of the old RISC instruction sets, you can replace 'is' by 'was' since ARMv8.

---

to3m 106 days ago [-]

The ARM approach ((to immediate constants)) is clever, but you only load 8 bits. Suppose you need a 32-bit constant - it's going to take 4 instructions. Compared to the halfword instructions, you do win with certain awkward constants such as 3<<15. That's why neither approach is as good as having variable-length instructions ;)

This made me wonder how common such constants are. So I grabbed some code I've been working on recently, for which I happened to have assembly language output, and searched for every immediate constant. (This must be the first time I've found a good use for gcc's nasty AT&T x64 syntax.)

My code targets x64, so take the "analysis" with a pinch of salt. Out of the 8982 instructions that had immediate operands, there were 819 unique 32-bit constants. 762 (93%) were high-halfword or low-halfword only, so they could have been loaded with one halfword instruction. By comparison, only 392 (48%) could been loaded with one instruction on ARM.

Ten constants were better for ARM, in that they would take two halfword instructions to form, but only one MOV or MVN: ['0x00ffffff', '0x03ffffff', '0x0fffffff', '0x3fffffff', '0x7fffffe8', '0x7ffffffe', '0x7fffffff', '0x80000003', '0xfffffffe', '0xffffffff']. These ten constants were used by 84 instructions out of the 8982.

---

chrismonsanto 106 days ago [-]

> x86 is not particularly nasty

x86's terrible reputation is well deserved. x86-64 fixes a number of problems, but for most of x86's lifetime we've had to live with 8 registers and the lack of IP-relative addressing...

Let's not forget how awful x87 floating point was, either!

---

Zardoz84 105 days ago [-]

> However, many of the later RISC architectures do share one annoying flaw. Immediate values are constants embedded within the instructions. Sometimes these are used for small values within expressions, but often they're used for addresses, which are 32-bit or 64-bit values. On PowerPC?, as on SPARC and MIPS, the mechanism for storing 32-bit immediates can only encode a 32-bit value by splitting it across two instructions: a "load high" followed by an "add". This is a pain. Sometimes the two instructions containing the value are some distance apart. Often you have to decode the address by hand, because the disassembler can't automatically recognise that it is an address. This design pattern turns up on almost all RISC architectures, though ARM does it differently (large immediates are accessed by PC-relative loads). When I worked on an object code analyser for another sort of RISC machine, I gave up on the idea of statically resolving the target of a call instructions, because the target address was split across two instructions, one of which could appear anywhere in the surrounding code.

> The x86 system for storing 32-bit/64-bit immediates is much nicer. They just follow the instruction, which is possible because the instruction length is variable. Variable-length instructions are not usually seen in RISC, the Thumb-2 instruction set being the only exception that I know of.

Hybrid way, that could be the best of both worlds : https://github.com/trillek-team/trillek-computer/blob/master...

On a few words, it uses a bit to know if the literal would be bigger that could normally stored on a instructions of 4 bytes. If it's true, the next 4 bytes is the literal value.

---

"If you do have the misfortune to have to work with SPARC"

So true... I work with the LEON (GPL implementation of SPARC) and have the occasional very bad day of SPARC asm code.

---

math in the RISC-V without a condition code register:

renox 105 days ago [-]

OK, please show me the code to do a long addition or a long multiplication in RISC-V. (long as in 'multiple words')

dietrichepp 104 days ago [-]

Here. 64-bit addition on RV32I.

    ; input 1 (msb r1, lsb r2)
    ; input 2 (r3, r4)
    ; output (r5, r6)
    xori    r5, r4, -1
    sltu    r5, r5, r2
    add     r6, r4, r2
    add     r5, r5, r3
    add     r5, r5, r1

---

another discussion on the desirability of integer overflow checks, and of hardware support for integer overflow checks:

https://news.ycombinator.com/item?id=10832137

---

i guess one distinction between an ISA-like thing and a language VM-like thing is that in an ISA, subroutines tend to have prologs and epilogs where you move parameters around, mess with stack pointers to allocate local variables on the stack, do callee- saving and restoring, etc. In a language VM like Lua's, you tend to just have a function signature that also states the # of locals, and a CALL instruction, and the language implementation does all that stuff upon CALL; there is no caller- or callee- save, but yet the language takes care of making it so that you can access your local variables again even after calling some other function.

---

i guess the analog of the 'custom instructions' thing in an AST walker would be to have a zillion/extensible AST node types, that is, the 'node type' or 'head' field in each node could indicate any one of the custom instructions directly, rather than having a node type 'CALL' and having the function to be called in one of the arguments.

And i guess the analog of having three-operand encoding and/or lots of addr modes is that you fold the leaf nodes of the AST into their parents; eg if you have c = a + b, instead of (ASSIGN c (ADD a b)), you can have (ADD c a b). Similarly for addressing modes, (ADD a (. x b)) becomes (ADD a x.b).

---

in assembly or bytecode (and in Root), the AST nodes are fixed length tuples (structs).

in Lisp, the AST nodes are arbitrary length lists.

in Oot, the AST nodes are graphs.

Note: the same thing can be said of the function call syntax for Lisp and for Oot, since in these languages, the 'normal' AST node is just a function call. In assembly, you tend to have a CALL opcode (or to consider the function call to be composed of maybe some pushes and maybe some other stuff and then a JMP; but you could think of each assembly language instruction as a function call to that instruction's implementation.

---

complementary to the previous section,

in cpus (and assembly language), memory (the primitive composite data structure) is a linear (flat, acyclic, homogeneous) array of unsigned integers (which are also pointers, when they are used to index into memory)

in lambda calculus, memory is functions (an associative-array (dict) of bindings from names to functions; but an associate array is just a function, so that's a function of functions)

in nock memory is cons cells.

in Lisp, memory is (nock + lambda calculus); cons cells + functions

in nock, memory is graphs (an associative-array (dict) of bindings from names to functions, but that's just a graph of graphs which is also a graph)

(otoh in assembly and in nock, a program is also just an array of integers, so you might say that in those, too, the items of the array can be functions)

---

considering the two previous sections,

is a programming languge homeoiconic when the type of a languages's AST nodes matches the type of its primitive composite data structure? eg in Lisp both may be said to be of type 'list'

so maybe Root Core or even Root should have an AST whose nodes are fixed-length tuples, as opposed to Oot where the AST nodes are graphs.

---

some ppl think putting the PC in a register in ARM was a mistake:

" http://www.arm.com/files/downloads/ARMv8_Architecture.pdf says:

31 general purpose registers accessible at all times
Improved performance and energy
General purpose registers are 64-bits wide
No banking of general purpose registers
Stack pointer is not a general purpose register
PC is not a general purpose register
Additional dedicated zero register available for most instructions "

ajross 1465 days ago [-]

It's interesting to see a very x86_64-like attempt to shake off the weirdness of the ancestral architecture here. The PC is no longer an addressable register. Thumb has been dropped. The predication bits are no more. The weird register aliasing thing done by NEON is gone too. The register banking (and it seems most of the interrupt architecture) is entirely different.

And, just like in the Intel world, market pressures have introduced all new CISC quirks: AES and SHA256 instructions, for example.

But of course an architecture document does not a circuit make. All the weirdness (old and new) needs to be supported for compatibility (OK, maybe they can drop Jazelle), so the fact that they no longer talk about some things doesn't really save them any transistors in practice.

Honestly, this is sounding more like an Intel or AMD part, not less.

---

the DCPU-16, an imaginary CPU for a game, is an interesting minimalism 16-bit CPU:

https://raw.githubusercontent.com/gatesphere/demi-16/master/docs/dcpu-specs/dcpu-1-7.txt

---

"I think the reason that almost every bytecode system under the sun is stack-based is that low-level bytecode is basically a bad idea these days, but it made a lot of sense in days when code density was really important. (Bytecode as a compact representation of an AST might still make sense.)

And stack-based bytecode is a lot denser than purely register-based bytecode. Hybrid bytecodes like Smalltalk-80’s, where you have 8 or 16 or 32 bytecodes that fetch and store to registers, are even denser. "

-- https://news.ycombinator.com/item?id=1680149#1682963

---

MIPS MT-ASE (multithreading primitives):

mainly: FORK YIELD and some stuff to copy registers between threads on the same processor

[1]

notes on FORK:

does NOT copy registers (figures the program can do that if necessary); this was considered too expensive (and presumably needed not often enough)
does NOT return the tid of the child thread to the parent (b/c "would require a register-file write beyond the write of the GPR value parameter to the register file of the new thread, which creates an undesirable constraint on the design of multithreaded register files"); the child thread must send that information to the parent manually, if desired
DOES take an argument of which PC the child thread should start executing at

YIELD takes an argument which is supposed to be used by the scheduler (sort of) as a specification of upon which conditions the thread should be reawakened. -1 means to just return control to the scheduler for now and request it to re-schedule this thread as soon as the scheduler wants.

---

stusmith1977 1597 days ago [-]

> Explicit access to PC as a register, makes "computed goto" trivial and very natural.

Am I right in thinking ARM uses this model? I haven't worked on them since ARM2, but I have a hazy recollection...

Symmetry 1597 days ago [-]

Sort of. R15 is the PC, but you also have branch instructions which have very large literal values which in practice you'll always use.

http://www.wss.co.uk/pinknoise/ARMinstrs/ARMinstrs.html#Bran...

---

"Scheme 48...The name was intended to reflect the idea that the implementation should be so clear and simple that you could believe the system could have been written in 48 hours (or perhaps, when S48 got complex enough, this was altered to "could have been *read* in 48 hours")....Its innovations were its module system, the language in which its VM was defined ("pre-scheme"), and its stack-management technology. ... Pre-scheme was quite interesting. Kelsey published a paper on it, as well, I believe. It was Scheme in the sense that you could load it into a Scheme system and run the code. But it was restrictive -- it required you to write in a fashion that allowed complete Hindley-Milner static type inference, and all higher-order procedures were beta-substituted away at compile time, meaning you could *straightforwardly* translate a prescheme program into "natural" C code with C-level effiency. That is, you could view prescheme as a really pleasant alternative to C for low-level code. And you could debug your prescheme programs in the interactive Scheme development environment of your choice, before flipping a switch and translating to C code, because prescheme was just a restricted Scheme. The Scheme 48 byte-code interpreter was written in prescheme. Prescheme sort of died -- beyond the academic paper he wrote, Kelsey never quite had the time to document it and turn it into a standalone tool that other people could use (Ian Horswill's group at Northwestern is an exception to that claim -- they have used prescheme for interesting work). There are ideas there to be had, however. " -- [2]

prescheme appears to be implemented in http://www.s48.org/cgi-bin/hgwebdir.cgi/s48-stable/file/c0f7ae288483/scheme/prescheme

https://en.wikipedia.org/wiki/PreScheme http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=6B5F9AA97F777931E653AF15F7C9AF84?doi=10.1.1.3.4031&rep=rep1&type=pdf http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.37.6326&rep=rep1&type=pdf

---

pcwalton 1003 days ago [-]

Unsurprisingly, I disagree with most of this post.

The part about Rust having no shared memory is just wrong. Rust has a very rich set of concurrency primitives at this point: you can use immutable shared memory (Arc), mutexes (MutexArc?), reader-writer locks (RWarc), and atomic variables (AtomicInt? and friends). And if you're willing to drop down to unsafe code, you get the full set of LLVM concurrency primitives.
The complaint about smart dereferencing strikes me as odd. Almost every new systems language—Objective-C, D, Go, and Rust (and also Nimrod I think?)—does the same thing that Rust does in that the "." operator also works on pointers. It's type-directed: if you don't know whether you're working with a pointer or not, just go look at the type of the value you're working with. If you don't know a value's type, then you have much bigger problems than knowing whether you'll get an extra memory access on ".".

Hiding the difference between a memory access and a cheaper operation is something compilers have done ever since the invention of register allocation.

I don't know what the complaint about "bugs with small objects" is, but all word-sized objects were made into LLVM Values a while back, making us as efficient there as you can be. Unless I'm misunderstanding the complaint.
Doing code duplication to accommodate different types of smart pointers doesn't work. The smart pointers have different semantics: that's why they exist in the first place. You can't just copy and paste your code.
The GC in Rust does need work. But so does C++'s GC and nobody is saying "C++ is not fast" because of it.

About the only part I agree with is is "fork-join parallelism and work-stealing should be better supported". This is an area we'll need to flesh out more fully at some point. Note that there is now a much more advanced work-stealing scheduler. (It's not using the most efficient data structures for work stealing yet though.)

---

the problem with having RootB?, and even Root Core (given that a motive is interoperability) is that different target platforms will have different levels at which they want to interoperate. For example, Pascal will want fixed-length arrays, so you'll want Root Core to have fixed-length arrays and for the Oot Core implementation to implement variable-length arrays on top of that, and hetrogeneous arrays on top of that, and map the platform fixed-length arrays to the Root Core fixed-length arrays. But many other platforms support variable-length homogeneous arrays natively, so you'll want Root Core to have variable-length homogeneous arrays, and map those to the platform, and implement hetrogeneous arrays on top of that.

If there was one linear hierarchy then Root Core could just be the lowest common denominator, but sometimes there are disjoint things; eg

"...how closures are avoided in C and Pascal. C allows the unrestricted use of procedures as values, but does not allow nested procedure declarations. Pascal allows nested procedures, but restricts how procedure values may be used." -- [3] [4]

although again, we could just take the lowest common denominator.

So i guess what we should do is take the lowest common denominator and then build up from there step by step, being careful to allow platform primitives to be mapped to any intermediate level, rather than just the Root Core primitive level. With unusual stuff like Nock we probably get screwed up with this approach, but i bet we can make it work for the majority of things. Again, the best implementations will still directly implement Oot Core, Root Core is just to make it easier to get an okay implementation quickly.

---

so quint:

meta verb subject object1 indirect object (object2)

quint MEDIUM instruction format:

2 form bits 2 instruction modifier bits 5x fields of (9 bits + 2 bit addr mode + 1 field modifier bit)

64 bits

---

so what should the 'meta' field of a quint be? Maybe 'level'. An obvious choice is 'language'. What if we had "tall interpreter towers', eg we had an interpreter running on top of an interpreter running on top of an interpreter etc, for many levels. This reminds me of 3-Lisp, i should read that paper sometime (and the other one based on it, Reflection for the Masses). Then level=language. Recall that part of my 'level' idea is that at different levels, the different opcodes are supposed to mean the same high-level concept; if opcode 11 is ADD then it should be ADD at level 1 and also at level 7, although the exact semantics of ADD may be different.

The point is that, like polymorphic code, you can pass around first-class functions with definitions like "square = {MUL x x}", and this same function can be passed between, and run on, many levels (with slightly different semantics; eg is it lazy, etc).

Tall interpreter towers might aid the motive of interoperability, because it would allow not only user-defined functions, but everything about the language, to change between levels.

A tall interpreter tower would be way inefficient, but an advanced implementation could implement a higher level of the tower directly on the native platform.

Also, perhaps some things persist between levels; such as the PC, instruction decoding, dispatch, basic addressing (although there can be unboxing, wrapping, etc)

With such a setup, you don't need 'unboxing' addr modes anymore. Higher level languages box everything. And if they want to do unboxed computations, they can call the lower-level language.

Indirect addressing on the 'language' field can mean 'just interpret this in whatever the current language is'. Of course, higher languages can override this and interpret code in another language than it specifies.

One problem with this is that if "square = {MUL x x}" is really being passed around, then it shouldn't specify a language. The language should be in a register, not in an instruction. But in that case, most code will not want to specify a language, so what's the point of having a 'language' field in every instruction? Maybe 'language' should specify the language used to run the instruction given, not the language used to encode the current instruction. And/or maybe it should specify a minimal language level, not the exact language level, and if the current one is higher, just use that. Or, maybe the 'language' should be relative, not absolute (reminds me even more of 3-Lisp).

should the language numbers have to always refer to a linear sequence in a tower (total ordering), or could you have a partial order (eg a register machine and a stack machine which are siblings?)

---

What about things with more than one output? Have the second output be ERR. 'ERR' doesn't have to be an error condition, it could just be success/failure, like with CAS, or indeed anything. Should we constrain ERR to be one bit? Yeah sure, why not.

---

I keep feeling that the second 4 addr modes (indirect-offset, indirect-read-post-increment-write-pre-decrement, indirect-read-post-decrement-write-pre-increment, subexpr) are kinda hacky, except for subexpr, and for all of them, they seem 'contingent' eg not needed in all cases and unclear how to design. So maybe leave them out too. As noted above, with quints, there is no need for an unboxing bit.

also it seems hacky to have side-effectful addr modes, even though it's neat that they double as stack addressing modes.

---

also maybe we should just say that only operand0 can be output, full stop. So eg 'ROT' (on the operands, not on the stack) wouldn't be allowed. I guess this kinda prohibits reversible computing, though..?

---

so is it too irregular to have the PC as a register?

is it too irregular to have ERR outputs sometimes?

is it too irregular to some instructions that use operand1 and/or operand2 as outputs, in addition to operand0?

is it too irregular to have side-effectful addr modes?

is it too irregular to have addr modes 4-7?

---

regarding the quint encoding,

if we have uniform fields, then working with numbers divisible by 5, we'll have 12 bits per each field, and 4 bits leftover for the whole instruction (of which we want 2 form bits, so 2 are left).

We need at least 2 addr mode bits, so we have 10 bits per field left over. That isn't much, and we don't need unboxing, and addr modes 4-7 are questionable, so don't add more addr modes. Since 10 isn't a power of 2, we sort of have room to use 2 of these bits for other stuff. If one of them is a sign bit, and we used 1 for a modifier, then we have -255 to +256, which seems pretty good.

---

checked out the 3-Lisp stuff briefly. I think it partially get it. One note is that in addition to 'reflective procedures', 3-Lisp included operators 'up' and 'down', which seemed like they would be applied to each individual operand, rather than to the instruction as a whole. Also, procedures were designed reflective at the definition site, not at the callsite, which makes sense because reflective procedures got additionally arguments 'environment' and 'continuation' (although i might just use 'up' on procedures at the callsite to control whether they are called as reflective procedures or as ordinary ones).

perhaps we could break up our 9-bit 'language' field into 1 bit for the whole instruction + 4 2-bit parts, one of each other field.

i suppose we could use the 1st bit to choose between:

the other bits are an 8-bit 'language' specifier for the whole quint, or
the other bits are 4x 2-bit fields for each field in the quint

we could also have a choice:

1 form bit + 1x 4-bit whole-quint language specifier + 4x 1-bit fields for each field in the quint

but i don't think that's as useful, because you really want at least 2 bits (so you can represent at least relative levels -1,0,1)

note: a 2-bit meta level is also attractive to me b/c of my other, meta project.

however if we are applying a meta-level to each field, shouldn't this just go with each field? not if we want the option for an 8-bit language specifier...

the thing is, 'up' and 'down' can always be separate operations applied to the operands in previous instructions, so we don't really need that.

otoh the 3-Lisp stuff is kind of the opposite of what we want for Root; for Root, we were considering a tall interpreter tower, built up piece by piece, with a least-common-denominator base to aid interoperability. Base Root eschews metaprogramming to make implementation easy. But 3-Lisp is the opposite; 3-Lisp emulates the initial program being run, not at the base, but at the TOP of an infinitely tall interpreter tower; and the 3-Lisp stuff gives this program lots of metaprogramming capabilities.

otoh there's that dude, Tom Lord, who was proposing fexprs as a way to concisely express Scheme semantics, which is close to what we want here. But i guess that outside of 3-lisp, fexprs don't have to be on top of an infinite tower.

---

i guess see also proj-oot-ootLevels and proj-oot-ootLevelsNotes1.

---

speaking of my meta project, the meta levels as 'data, information, knowledge' (what was above that?) corresponds somewhat to typing:

data: numbers without units: untyped data
information: numbers with units: typed atomic data (without pointers?)
knowledge: numbers with units arranged in a structure: typed atomic data with pointers
perspective: knowledge that can be viewed in different ways (isomorphisms, homomorphisms, etc; category theory applies here too, i guess): typed atomic data with pointers, and views (note though that the views can rearrange stuff across data object boundaries)

but the main meta levels (hyper, meta, 3, ?) don't directly correspond to that (except insofar as they already correspond to data/information/knowledge/perspective).

---

could/should our first-class 'environment' be in a register?

---

instead of/in addition to 'language', the meta operand could specify a 'context', eg the current state of memory, registers (variable bindings), stack (and side effects, if we're in an emulation); or just an 'environment', ie local variable bindings, ie registers.

---

the thing is, the 'language' or 'context' argument will probably be very infrequently used. So giving it 12 bits in every instruction is a pretty big waste.

---

i like this:

" Go and the plan9 C compilers have their own ABI

Go, and C code compiled by 6c don’t use the “normal” SysV? ABI and calling convention, but have their own. It’s striking in its simplicity compared to what you may be used to:

    All registers are caller-saved
    All parameters are passed on the stack
    Return values are also returned on the stack, in space reserved below (stack-wise; higher addresses on amd64) the arguments.

In general, this seems to be consistent with a pattern among the plan9 tools of preferring simplicity and consistency over marginal performance gains. It’s really hard to say that they’re wrong for making that choice.

You can see this at work in the ((add)) example above, where we pull arguments from 0(FP) and 8(FP), and return the sum by writing back into 16(FP). This also explains why we had a 24-byte argument frame, despite only accepting two eightbyte arguments — the return value also gets a slot in the argument frame. " [5]

---

ridiculous_fish 841 days ago [-]

The ABI (all registers caller-saved, parameters and return values unconditionally passed on the stack) is quite simple compared to C. Is it still in flux, or is it considered done?

Of course the Go team is aware of this, but once you support dynamic linking, ABIs become very hard to change. A bad one will weigh you down for a long time! Apple famously used PC-relative addressing on PowerPC?, which has no program counter register. It was, let's say, measurably suboptimal, but they were stuck with it until the Intel transition (or technically the PPC64 one).

---

CyberShadow? 841 days ago [-]

The ABI seems to sacrifice performance for the sake of simplicity, which IMHO has little value at such a low level. There's a reason why the x86-64 ABI passes 6 (4 for the MS variant) machine words via registers! Disappointing.

Both D and Go use an unusual convention. The GCC maintainers raised issues over D's ABI during discussions regarding integrating the GCC-based D compiler. I wonder how did this apply to Go's inclusion? Or did GCC support the Plan9 ABI before Go's inclusion?

pjmlp 841 days ago [-]

One of the Go developers is part of the GCC project, maybe that helps?

---

PNaCl? (subset of LLVM) does not support "taking the address of a label for computed goto, or nested functions." -- [6]

---

i say performance is an anti-feature (non-feature, really) of Oot. Well, to a point; the annotations on page 8 (PDF page 6) of https://github.com/groupofn/3Lisp.Ruby/blob/master/3LispManualPrimer.pdf say that a simple tail recursive loop counting up to 1e6 took 12 minutes on a Mac Pro in 2011!

So, Oot can be slow, sure, but not 3lisp slow.

---

" https://en.wikipedia.org/wiki/Intel_i960

What it had was common RISC stuff you can run anything with, MMU, error handling of common errors, and object descriptors for OOP constructions or POLA enforcement. The other thing it had was speed despite all that. So, start with something like that for RISC-V or OpenSPARC?. Implement, a la B5000 or SAFE, some specific, tiny-cost checks for basic primitives: pointer/array bounds-checks; stack protection if it has one; code vs data tag (or Itanium-style read/write/execute per word or page); labeling & protection of pointers themselves. These checks run in parallel in tiny HW as CPU runs so essentially no overhead. IOMMU auto-manages tagging of incoming or outgoing I/O data to fit the model. Any failure takes CPU to exception handler with access to relevant data. Optional, dedicated HW for concurrent GC possibly built-in to MMU like some LISP machines. "

---

the scheme 79 chip had 32 bit tagged data elements, with 24 data bits, 7 type bits, and 1 garbage collection bit. The 24 data bits could be 'immediates' or pointers.

---

" ...chip manufacturers have progressed from single cycle to pipelined, cache accelerated, superscalar and out of order machines...At a high level...this progression trades off chip area, transistors (complexity) and power budget to wring more instruction level parallelism out of existing programs.

... ((in a)) stack machine architecture, there is no opportunity for the instruction level parallelism exploited by modern architectures as the exact state of the stack which is depended on by every instruction changes with every instruction. This forces all program execution to wait during slow operations like main memory reads, division and floating point math due to the sequential data dependency on the stack. This is a fundamental architectural failure common to all stack machines and skirted by modern software stack machines like the JVM by compiling literal stack instructions to a register machine in order to recover instruction level parallelism. " [7]

---

" ...techniques like cdr coding serve to bring the average case of list access and construction down from their worst case behavior of generating massive heap fragmentation (and thus defeating caching and prefetching) towards the behavior of nicely caching sequential data layouts (classical arrays) typically via tree like structures or SRFI-101. While traditional linked list structures due to potential heap fragmentation provide potentially atrocious cache locality and this cripple the performance of modern cache accelerated machines, more modern datastructures can simultaneously provide the traditional cons list pattern while improving cache locality and thus performance characteristics. Guy Steele himself has even come out arguing this point. " [8]

---

" [–]kamatsu 2 points 10 months ago

A couple decades ago PL researchers devoted a lot of cycles to make compilers produce readable assembly code so that programmers could tweak and modify the output. Apparently this was desired.

A few years later PL people realised that it's extremely difficult at best, impossible at worst, and usually results in substandard performance or code quality.

I recently had to gently let down a prominent systems researcher when he requested our purely functional spec language compile to idiomatic C.

---

maybe split constant table into 'word-sized constants' and 'pointer-sized constants' and 'arbitrary length constants'? hmmm...

should constant table and jump table be separate? probably, but makes loading a label for computed goto harder esp. in SHORT mode

---

to save opcode space in SootB?/SHORT, make as many things as possible SYSCALLS, including definesyscall, RECV/SEND

if we have LOADI, then LOADK isnt actually needed, a constant table is an efficiency/convenience/luxury, and we're implementing module loading on top of SootB?.

---

could have a SEMICOLON annotation in addition to parens.

only semicolons can be JMP and branch targets?

---

the ~9 or 10 bit operands in a quintuple format are enough for efficient specification of the Oot implementation. But they are probably not enough to be a good 'interop IR' for transpilation; because too much Oot code that we transpile might need more than 256 variables, etc. Although it's tempting..

if we use a quintuple format, the 5th 'meta' field, instead of/in addition to specifing which LANGUAGE/INTERPRETER should be used, could specify a CONTEXT or just an ENVIRONMENT; or it could contain prefix modifiers, and per-operand custom bits like custom addr modes, unboxing bits, atomicity bits, etc. LANGUAGE and CONTEXT switches will be rare enough that those could be a register, or special switching instructions, but prefix modifiers and per-operand custom bits are useful elsewhere.

the thing is, all that sort of stuff won't be used too frequently, and requires a higher-level interpreter (=later 'stage' interpreter, higher rung in the interpreter tower, etc) to implement it extensibly.

So just use LONG for that stuff.

at least this kind of thinking helps in suggesting semantics for LONG.

---

it would be nice if we could make the 'capability kernel' smaller, that is the primitive capability instructions that we expect any OotB? to implement. Currently they are:

?: CAPMALLOC and CAPMRELEASE: create/release a new subaddress space that can hold capabilities. CAPMALLOC requires the process to be holding 'c' capability and the 'm' capability. ?: CAPUNION: capability union: CAPUNION(dest, cap1, cap2) ?: CAPONLYADDR: from a capability, create a new capability that only contains those permissions relating directly to the address subspace containing the given pointer. CAPONLYADDR(dest, cap, pointer) ?: CAPONLYPERM: from a capability, create a new capability that only contains those permissions in a certain type mask. CAPONLYPERM(dest, cap, permission_type_mask). ?: CAPGETPERM: query the permissions granted by a capability for the subaddress subspace containing the given pointer (returns a 16-bit permission_type_mask). CAPGETPERM(dest, cap, pointer) ?: CAPGETADDRS: query a list of all subaddress spaces (directly) contained in a capability, and create a (CAPMALLOC'd) address subspace containing them, returning this new subaddress space in a pointer. CAPPUSHADDRS(dest, cap). (note: this is a neat trick to create a low-level list, but it means that accesses to this list won't be boundschecked by higher-level code at typechecking time. I am a little uncomfortable with that. I suppose we can encapsulate this with a function returning a higher-level list later. But in fact, i'm somewhat uncomfortable with the very idea of subaddress spaces with dynamic size. I guess this can't be helped if we have dynamic-size MALLOCs, though?) ?: CAPSEAL: 'seal' a capability to an address, rendering it unusable except by CAPUNSEAL or CAPJMPSEAL. CAPSEAL(dest, cap, address). The purpose of this instruction is to allow an untrusted 'middleman' to transport capabilities without being able to use them. ?: CAPUNSEAL: unseal a sealed capability; requires 'u' permission on the address subspace containing the seal address. CAPUNSEAL(dest, cap) ?: CAPJMPSEAL: JMP to the given address, after first replacing CAP with the union of the contents of the given (potentially sealed) capability CAPEXE and the (potentially sealed) capability CAPDATA, provided that (a) both CAPDATA and CAPEXE are sealed to the same address, and (b) CAPEXE has (potentially sealed) execute permission on the address that is being JMP'd to. CAPJMPSEAL(addr, CAPEXE, CAPDATA). Note that 'u' unseal permission is NOT required. If there is an error preventing the JMP from completing, CAP will not be affected. The purpose of this instruction is to allow certain areas of memory to only be operated upon via calling into certain code.

---

Brittle systems

    nickpsecurity a year ago comments

That's pretty badass. It would be a pain to try to work with now given much of it is strange to a modern user or admin. Cool that they built it. My takeaway from Burrough's is to do much interface protection at compile time, pointers + code protection in memory during runtime, bounds-checking on arrays, simplified assembler, dedicated I/O processor/programs, and good recovery of failed components. That plus what we know from other systems would make for quite a resilient system combined with a RISC processor such as Gaisler's Leon SPARC I.P. or the Rocket RISC-V CPU. "

" What it had was common RISC stuff you can run anything with, MMU, error handling of common errors, and object descriptors for OOP constructions or POLA enforcement. The other thing it had was speed despite all that. So, start with something like that for RISC-V or OpenSPARC?. Implement, a la B5000 or SAFE, some specific, tiny-cost checks for basic primitives: pointer/array bounds-checks; stack protection if it has one; code vs data tag (or Itanium-style read/write/execute per word or page); labeling & protection of pointers themselves. These checks run in parallel in tiny HW as CPU runs so essentially no overhead. IOMMU auto-manages tagging of incoming or outgoing I/O data to fit the model. Any failure takes CPU to exception handler with access to relevant data. Optional, dedicated HW for concurrent GC possibly built-in to MMU like some LISP machines. Implemented either clean-slate or with configurable, open CPU like Gaisler's Leon3. "

so the list of runtime safety things he mentions:

bounds checking on pointers and arrays; stack protection; code vs data tag; labeling and protectio of pointers

---

The Burroughs B5000 - 20 Years Later and Still Ahead of the Times?

    2 points lolcraft 4 years ago comments

I see not that many features that I like and modern processors don't have. Array bound checking would be nice, of course. And yes, not having different pools of unrelated, different-size registers, and a more RISCy opset would be very very considerate. But some of these features seem very inferior by design:

Virtual memory -- useless with 4 GB of RAM at 30$. I'm personally also a proponent of managing this kind of thing with software, since we have OS's, and with good reasons.

String instructions: became useless in terms of performance when the Pentium pipeline came along. Which was undoubtedly a good design decision.

Custom opcodes: just write an interrupt handler for INT3, or whatever INT you like. Your OS might not let you in user mode, but some might say that's a feature.

LOCK existed from the beginning of the multiprocessor era, so please don't act so smug about it. It's like saying that your submarine swims.

The paging unit of the x86 allows the presence of a tag, independent of pointers, including code/data separation. And I'm not that keen on implementing a typesystem in hardware. Extensibility and all that.

Not separating opcodes for different semantics is a bad idea. It's like overloaded operators. Who knows what you are doing, given that types are not known in runtime. I'd personally like to see assembly become a more typeless programming language, anyway.

Vector operations and such... Well, in reality, they probably would slow down the processor, compared to having many atomic instructions, RISC being faster to execute than CISC and all that. "

---

C11: A New C Standard Aiming at Safer Programming

    15 points AngryParsley 4 years ago comments

    //C11, safe version of strcat
    errno_t strcat_s(char * restrict s1, 
                     rsize_t s1max, 
                     const char * restrict s2);

strcat_s() copies no more than s1max bytes to s1. The second function, strcpy_s() requires that s1max isn't bigger than the size of s2 in order to prevent an out-of-bounds read:

    //C11, safe version of strcpy
    errno_t strcpy_s(char * restrict s1, 
                     rsize_t s1max, 
                     const char * restrict s2);

Originally, all of the bounds-checking libraries were developed by Microsoft's Visual C++ team. The C11 implementation is similar but not identical.

---

" serious issues in C's design that were good or necessary choices in 1972, but are undesirable legacies today: not having bounds checking, not specifying the order of evaluation for many constructs, allowing arbitrary and unchecked type casting, unsafe memory management, etc. "

---

we probably want a command like 'defnsyscall' to define new syscalls in software. Clearly we usually don't want these to override our platform-provided native syscalls.

note: we want ifdefnsyscall and defnsyscall, not just ifndefdefnsyscall, because of situations where there are multiple possible 'basis sets'; eg one platform may have a NAND primitive, another one may have a NOR, another one may have AND,OR,NOT; how we implement the others may depend on which ones are the platform primitives; eg if we implement AND,OR,NOT in terms of NAND, and then NAND in terms of AND,OR,NOT, that won't work for a system where the primitive is NOR.

---

actually, for an 'interop IR', we probably DO want polymorphism, because we want to express eg "(add 2 2)"

---

we would like SHORT to include everything we do inside expressions in other languages.

That would include CALL and '?' (trinary IF expression).

So mb:

SIF: Pop the first three values A,B,C from the stack. If A==0, push B. Otherwise, push C.

However, often the way that CALL is implemented is pushing the return address onto a call stack, which is not possible here because the return address is within one 64-bit chunk. We could still CALL if it's inlined though.

In other languages you can CALL things with loops, though, so should we allow things that you CALLINLINE to things that do normal jumps and calls? In which case, they'll have to end the chunk anyways, so what's the point of worrying about fitting things into the middle of a SHORT chunk?

It might be nice to have a restricted, inlinable call. But we already have the custom instructions, which is here 'SYSCALL'.

So maybe just have a normal CALL, with the restriction that it returns to the next 64-bit aligned instruction.

later: i removed SIF b/c i put in SKZ, SKNZ, and SKIPN, which allows you to implement SIF.

---

i removed

?:J addr: absolute jump

b/c we have computed jump and that's enough

removed

CALL addr: CALL. Push a pointer to the next 64-bit aligned instruction, then jump to addr. The return from the call will be to the next 64-bit aligned instruction, skipping any instructions in between this one and that one.

---

(SKIPN, SIF, and CALLINLINE are provided so that we can do the sort of things that are usually allowed within expressions, where JMP (J) is prohibited) (todo: shouldn't CALLINLINE have some requirement not to alter the stack above the call, except for its arguments? But then what about using it as an actual many-argument CALL in our RPN parens system? Hmm.. think about the goal here; the goal is to provide a construct that almost any platform will natively support within expressions. So i guess it's up to the RPN parens system to check how many arguments (stack items) CALL consumes/adds)

---

removed

SKZ: Pop; if the value popped was zero, then skip the next instruction
SKNZ: Pop; if the value popped was non-zero, then skip the next instruction
SKIP n: Pop; skip the next n instructions

went back to BRs; the idea behind the SKIPs was that it might be easier for sub-byte instructions, but i guess it's not.

---

removed

PUSHPC

b/c it doesnt help with subinstructions anyways

---

removed

ISSYSCALLDEFN: is the given syscall defined?
DEFNSYSCALL: does not support redefinition, i'm guessing?

b/c custom instructions are for OotB?, not SootB?

---

should we have some sort of 'exec' to transfer control wholesale/permanently to another interpreter?

3-Lisp sort of does this when it calls a 'reflective procedure', transferring full control and just giving it a continuation (like a return address)

but if we have capabilities, and we have 'exec', then shouldn't we still make sure that the 'exec'd program can't access things beyonds its capabilities?

so maybe have a 'root' capability, which is needed to be able to 'exec'.

---

One reason that having sootb be a subset of Oot would be because we could use OotB? tooling to debug it, etc

---

And/or, SootB? can just be our SHORT encoding!

---

regarding compiliation 'stages':

A 'stage' would be one floor of an interpreter tower; that is, when we are running an interpreter, and we write another interpreter in its language, then run that code on top of the first interpreter.

Usually, when we are just adding features to a language, we can use metaprogramming. However, this generally allows programs to use either the metaprogrammed features or the original features.

Stages are needed whenever we transition from a less safe/secure language to a more safe/secure language.

For instance, we need to have a new stage to do any of:

capabilities
bounds checking
typechecking
preemptive scheduling of greenthreads
possibly, exception handling

Stages might not hurt interoperability because typically the secure semantics (capabilities etc) are unneeded, in the sense that if you call code in another language, or if you compile Oot into another language, you are willing to accept that the code you called or compiled isn't bound by your capability system.

So, maybe we do want stages after all.

It does seem a little overengineered. But it seems like the best choice available, because stages would allow us to have a very simple to implement SootB?, and yet have a secure OotB?.

---

let's talk about subexpressions and the data stack.

Let's make the data stack only for subexpressions, and remove the subexpr addr mode. Is there anything else we'd want on the data stack? Local variables? We put those in regs (registers) (maybe the implementation actually puts these on the stack, but that's none of our business). Local scopes? Put them on the call stack.

Let's have paren and semicolon annotations.

Semicolons are also 'sequence points' in that things in between semicolons can evaluated in any order.

Let's have an RPN convention. Eg:

(2 3 +)

So how to define/enforce an RPN convention? How about:

No POPs can be done except right before a close parens. After a close parens, the stack must be at the same height that it was at the matching open parens. At no time may the stack below where it was at the most recent open parens be disturbed/mutated.

Maybe soften that a little. POPs are allowed whenever, provided that the stak below where it was at the most recent open parens is never disturbed. So you can, eg DUP and then DROP. (but really, that contradicts the main intent here of having plain expressions for interop..)

Can only PUSH when there is an open paren.

There can be no open parens when there is a semicolon. So height of data stack (for this function) must be 0 at semicolon.

JMPs can only target a semicolon.

There can be no JMPs within expressions, and no backwards branches, but there can be CALLs. Maybe even prohibit everything except for if and call.

These rules (try to) ensure that an transpiler can reconstruct an expression into a simple parenthesized expression, of the sort that is permitted in most languages.

So now we have Lisp calls of arbitrary length!

---

there should be a different opcode for 'impactful' or 'consequential' annotations, such as prefix modifiers, and merely informative annotations, defined as things which can be ignored by an interpreter at runtime.

---