proj-oot-ootAssemblyNotes17

---

l5 has .ethereum

---

" Side note: The .NET framework did break backwards compatibility when moving from 1.0 to 2.0, precisely so that support for generics could be added deep into the runtime, i.e. with support in the IL. "

---

could our LONG format be too complicated? Maybe all we really want is a simple hierarchy with tag-length-value (TLV) nodes?

---

" RISC-V ISA Overview The RISC-V ISA is defined as a base integer ISA, which must be present in any implementation, plus optional extensions to the base ISA. The base integer ISA is very similar to that of the early RISC processors except with no branch delay slots and with support for optional variable-length instruction encodings. The base is carefully restricted to a minimal set of instructions sufficient to provide a reasonable target for compilers, assemblers, linkers, and operating systems (with additional supervisor-level operations), and so provides a convenient ISA and software toolchain “skeleton” around which more customized processor ISAs can be built ... Each base integer instruction set is characterized by the width of the integer registers and the corresponding size of the user address space. There are two primary base integer variants, RV32I and RV64I, described in Chapters 2 and 4, which provide 32-bit or 64-bit user-level address spaces respectively. Hardware implementations and operating systems might provide only one or both of RV32I and RV64I for user programs. Chapter 3 describes the RV32E subset variant of the RV32I base instruction set, which has been added to support small microcontrollers ... The base RISC-V ISA has fixed-length 32-bit instructions that must be naturally aligned on 32-bit boundaries. However, the standard RISC-V encoding scheme is designed to support ISA extensions with variable-length instructions, where each instruction can be any number of 16-bit instruction parcels in length and parcels are naturally aligned on 16-bit boundaries. ... We chose little-endian byte ordering for the RISC-V memory system because little-endian sys- tems are currently dominant commercially (all x86 systems; iOS, Android, and Windows for ARM). A minor point is that we have also found little-endian memory systems to be more nat- ural for hardware designers ... Chapter 2 RV32I Base Integer Instruction Set, Version 2.0

This chapter describes version 2.0 of the RV32I base integer instruction set. Much of the commen- tary also applies to the RV64I variant. RV32I was designed to be sufficient to form a compiler target and to support modern operating system environments. The ISA was also designed to reduce the hardware required in a mini- mal implementation. RV32I contains 47 unique instructions, though a simple implementation might cover the eight SCALL/SBREAK/CSRR* instructions with a single SYSTEM hardware instruction that always traps and might be able to implement the FENCE and FENCE.I in- structions as NOPs, reducing hardware instruction count to 38 total. RV32I can emulate almost any other ISA extension (except the A extension, which requires additional hardware support for atomicity).

...

Figure 2.1 shows the user-visible state for the base integer subset. There are 31 general-purpose registers x1 – x31 , which hold integer values. Register x0 is hardwired to the constant 0. There is no hardwired subroutine return address link register, but the standard software calling convention uses register x1 to hold the return address on a call. For RV32, the x registers are 32 bits wide, and for RV64, they are 64 bits wide. This document uses the term XLEN to refer to the current width of an x register in bits (either 32 or 64). There is one additional user-visible register: the program counter pc holds the address of the current instruction. The number of available architectural registers can have large impacts on code size, performance, and energy consumption. Although 16 registers would arguably be sufficient for an integer ISA running compiled code, it is impossible to encode a complete ISA with 16 registers in 16-bit instructions using a 3-address format. Although a 2-address format would be possible, it would increase instruction count and lower efficiency. We wanted to avoid intermediate instruction sizes (such as Xtensa’s 24-bit instructions) to simplify base hardware implementations, and once a 32-bit instruction size was adopted, it was straightforward to support 32 integer registers. A larger number of integer registers also helps performance on high-performance code, where there can be extensive use of loop unrolling, software pipelining, and cache tiling. For these reasons, we chose a conventional size of 32 integer registers for the base ISA. Dy- namic register usage tends to be dominated by a few frequently accessed registers, and regfile im- plementations can be optimized to reduce access energy for the frequently accessed registers [26]. The optional compressed 16-bit instruction format mostly only accesses 8 registers and hence can provide a dense instruction encoding, while additional instruction-set extensions could support a much larger register space (either flat or hierarchical) if desired. For resource-constrained embedded applications, we have defined the RV32E subset, which only has 16 registers (Chapter 3). " -- [1]

---

https://chrissherlock1.gitbooks.io/inside-libreoffice/content/system_abstraction_layer.html

---

The Michelson language (see my notes in [[plbook-plChMiscIntermedLangs?]] or their PDF) is a nice little low-level typed functional stack-based language. Too high level for Boot but maybe good for Ovm.

Except that i think that their regexp-based macro-defined instruction classes (P(A*AI)+R, C[AD]+R, DII+P) are too powerful for a low-level language (although cool for a HLL).

---

[2]:

"

  1. define SYS_exit (SYS_BASE + 1)
  2. define SYS_read (SYS_BASE + 3)
  3. define SYS_write (SYS_BASE + 4)
  4. define SYS_open (SYS_BASE + 5)
  5. define SYS_close (SYS_BASE + 6)
  6. define SYS_getpid (SYS_BASE + 20)
  7. define SYS_kill (SYS_BASE + 37)
  8. define SYS_gettimeofday (SYS_BASE + 78)
  9. define SYS_clone (SYS_BASE + 120)
  10. define SYS_rt_sigreturn (SYS_BASE + 173)
  11. define SYS_rt_sigaction (SYS_BASE + 174)
  12. define SYS_rt_sigprocmask (SYS_BASE + 175)
  13. define SYS_sigaltstack (SYS_BASE + 186)
  14. define SYS_mmap2 (SYS_BASE + 192)
  15. define SYS_futex (SYS_BASE + 240)
  16. define SYS_exit_group (SYS_BASE + 248)
  17. define SYS_munmap (SYS_BASE + 91)
  18. define SYS_madvise (SYS_BASE + 220)
  19. define SYS_setitimer (SYS_BASE + 104)
  20. define SYS_mincore (SYS_BASE + 219)
  21. define SYS_gettid (SYS_BASE + 224)
  22. define SYS_tkill (SYS_BASE + 238)
  23. define SYS_sched_yield (SYS_BASE + 158)
  24. define SYS_select (SYS_BASE + 142) newselect
  25. define SYS_ugetrlimit (SYS_BASE + 191)
  26. define SYS_sched_getaffinity (SYS_BASE + 242)
  27. define SYS_clock_gettime (SYS_BASE + 263)
  28. define SYS_epoll_create (SYS_BASE + 250)
  29. define SYS_epoll_ctl (SYS_BASE + 251)
  30. define SYS_epoll_wait (SYS_BASE + 252)
  31. define SYS_epoll_create1 (SYS_BASE + 357)
  32. define SYS_fcntl (SYS_BASE + 55)
  33. define SYS_access (SYS_BASE + 33)
  34. define SYS_connect (SYS_BASE + 283)
  35. define SYS_socket (SYS_BASE + 281) "

---

naasking 1 day ago [-]

Except you can't sandbox or virtualize the clock because mx_time_get() doesn't require a handle, which makes timing attacks easier.

You also can't sandbox event and channel creation for the same reason. It looks like these can also DoS? the kernel. In general, any operation you can perform without a handle tends to be subject to DoS? and you can't virtualize it. They're also subject to a different access control policy than the rest of the system which is based around handles.

And it's not really necessary. Just reserve the first few handles in a process table for a clock handle, a channel constructor/factory handle and an event constructor/factory handle, and now these operations can be fully virtualized and they aren't subject to DoS? because they can be rate-limited or at least traced back to specific handles which can be revoked.

Without tracing every operation to a handle, you have to pollute your model with more infrastructure to track this information, as with channels and events in Fuschia.

[1] https://fuchsia.googlesource.com/magenta/+/master/docs/conce...

reply

[3]

" Fuchsia

    HomeGetting StartedGlossary

Sandboxing

This document describes how sandboxing works in Fuchsia. An empty process has nothing

On Fuchsia, a newly created process has nothing. A newly created process cannot access any kernel objects, cannot allocate memory, and cannot even execute code. Of course, such a process isn't very useful, which is why we typically create processes with some initial resources and capabilities.

Most commonly, a process starts executing some code with an initial stack, some command line arguments, environment variables, a set of initial handles. One of the most important initial handles is the PA_VMAR_ROOT, which the process can use to map additional memory into its address space. Namespaces are the gateway to the world

Some of the initial handles given to a process are directories that the process mounts into its namespace. These handles let the process discover and communicate with other processes running on the system, including file systems and other servers. See Namespaces for more details.

The namespace given to a process strongly influences how much of the system the process can influence. Therefore, configuring the sandbox in which a process runs amounts to configuring the process's namespace. Archives and namespaces

In our current implementation, a process runs in a sandbox if its binary is contained in an archive (i.e., a FAR). As the package manager evolves, these details are likely to change.

An application run from an archive is given access to two namespaces by default:

    /svc, which is a bundle of services from the environment in which the application runs.
    /pkg, which is a read-only view of the archive containing the application.

A typical application will interact with a number of services from /svc in order to play some useful role in the system.

The far command-line tool can be used to inspect packages installed on the system. For example, far list --archive=/system/pkgs/root_presenter will list the contents of the root_presenter archive:

$ far list --archive=/system/pkgs/root_presenter bin/app data/cursor32.png meta/sandbox

To access these resources at runtime, a process can use the /pkg namespace. For example, the root_presenter can access cursor32.png using the absolute path /pkg/data/cursor32.png. Configuring additional namespaces

If a process requires access to additional resources (e.g., device drivers), the package can request access to additional names by including a sandbox metadata file in its package. For example, the following meta/sandbox file requests direct access to the framebuffer driver:

{ "dev": [ "class/framebuffer" ] }

In the current implementation, the AppMgr? grants all such requests, but that is likely to change as the system evolves. Building an archive

To build a package, use the package() macro in gn defined in packages/packages.gni. Specifically, to create a Fuchsia Archive (FAR) for your package (which will trigger sandboxing), set the archive flag to true. See the documentation for the package() macro for details about including resources

For examples, see [4] and [5].

"

---

do we need to allow pointer comparisons so that eg with a software-implemented stack, you can tell if the stack top is equal to the beginning or end of the stack (so that further popping/pushing would lead to an overflow/underflow)? probably...

---

log entries may need multiple 'topics', instead of single priorities, like in ETH

---

Parallelism and the ARM Instruction Set Architecture by John Goodacre and Andrew N. Sloss:

" To achieve this design, the ARM team changed the RISC rules to include variable-cycle execution for certain instructions, an inline barrel shifter to preprocess one of the input registers, conditional execution, a compressed 16-bit Thumb instruction set, and some enhanced DSP instructions. • Variable cycle execution . Because it is a load- store architecture, the ARM processor must first load data into one of the general-purpose registers before processing it. Given the single- cycle constraint the original RISC design imposed, loading and storing each register indi- vidually would be inefficient. Thus, the ARM ISA instructions specifically load and store mul- tiple registers. These instructions take variable cycles to execute, depending on the number of registers the processor is transferring. This is particularly useful for saving and restoring con- text for a procedure’s prologue and epilogue. This directly improves code density, reduces instruction fetches, and reduces overall power consumption. ... Conditional execution . An ARM instruction executes only when it satisfies a particular con- dition. The condition is placed at the end of the instruction mnemonic and, by default, is set to always execute. This, for example, generates a savings of 12 bytes—42 percent—for the great- est common divisor algorithm implemented with and without conditional execution. • 16-bit Thumb instruction set. The condensed 16-bit version of the ARM instruction set allows higher code density at a slight perfor- mance cost. Because the Thumb 16-bit ISA is designed as a compiler target, it does not include the orthogonal register access of the ARM 32-bit ISA. Using the Thumb ISA can achieve a significant reduction in program size. In 2003, ARM announced its Thumb-2 tech- nology, which offers a further extension to code density. This technology increases the code density by mixing both 32- and 16-bit instructions in the same instruction stream. To achieve this, the developers incorporated unaligned address accesses into the processor design. Enhanced DSP instructions. Adding these instructions to the standard ISA supports flex- ible and fast 16 × 16 multiply and arithmetic saturation, which lets DSP-specific routines migrate to ARM. A single ARM processor could execute applications such as voice-over- IP without the requirement of having a sepa- rate DSP. The processor can use one example of these instructions, SMLAxy , to multiply the top or bottom 16 bits of a 32-bit register. The processor could multiply the top 16 bits of register r1 by the bottom 16 bits of register r2 and add the result to register r3. "

---

" Many favored using the Intel cmpxchg8b instruc- tion in these lock-free routines because it can exchange and compare 8 bytes of data atomically. Typically, this involved 4 bytes for payload and 4 bytes to distinguish between payload versions that could otherwise have the same value—the so-called A-B-A problem. The ARM exclusives provide atomicity using the data address rather than the data value ((i think they mean ll/sc, which they mentioned earlier in the article), so that the routines can atomically exchange data without experiencing the A-B-A problem. Exploiting this would, however, require rewriting much of the existing two-word exclusive code. Consequently, ARM added instructions for performing load-and store exclusives using various payload sizes— including 8 bytes—thus ensuring the direct porta- bility of existing multithreaded code "

---

" To support TLS in C and C++, the new keyword thread has been defined for use in defining and declaring a variable. Although not an official exten- sion of the language, using the keyword has gained support from many compiler writers. Variables defined and declared this way would automatically be allocated locally to each thread: "

---

" The GIC also uses various software-defined pat- terns to route interrupts to specific processors through the interrupt distributor. In addition to their dynamic load balancing of applications, SMP OSs often also dynamically balance the interrupt handler load. The OS can use the per-processor aliased control registers in the local private periph- eral bus to rapidly change the destination CPU for any particular interrupt. Another popular approach to interrupt distribu- tion sends an interrupt to a defined group of proces- sors. The MPCore views the first processor to accept the interrupt, typically the least loaded, as being best positioned to handle the interrupt. This flexible approach makes the GIC technology suit- able across the range of ARM processors. This stan- dardization, in turn, further simplifies how software interacts with an interrupt controller. "

---

this suggests 2 'profiles'; a 16-instruction profile, and a 64-instruction profile (or should we have a third, 32-instruction profile in between, as with the 28-instruction set detailed two paragraphs up?). However, to keep Boot simple, maybe there should just be 1 profile (16 instructions), and the 64-instruction profile should be Ovm SHORT. Otoh the 64-instruction ISA is a useful assembly ISA outside of Ovm, so maybe it should be Boot Extended.

---

interestingly, using this same scheme of TWO, ONE, ZERO pseudoinstructions, one could have an 8-bit bytecode with 13 instructions (4 three-operand instructions, 4 two-operand, 4 one-operand, and 4 zero-operand, minus 3 for TWO, ONE, ZERO): annotate, loadi, load, store, add, sub, leq, skipz, jrel, cpy, jd, halt, ? (we are giving up push, pop, read, write). But this would have only 2-bit operands (so 4 registers and you can only loadi constants 0-15). Not crazy, but pretty spartan, especially since we gave up push and pop.

could it be worth to offer this encoding? probably not; although it's almost 2x more efficient than a 16-bit encoding, only having 4 registers probably makes the size of most subroutines balloon. For example, consider a simple memcpy-like subroutine, that counts down in a loop from n to 0, moving backwards in memory and copying; the straightforward way to do this involves 5 registers (note: we are still using stuff like 'load r3 from_baseaddr_in_memory' as a placeholder/shorthand for a longer sequence of instructions):

(note: i havent run this, it probably has bugs)

load r1 n_in_memory ; i = n load r2 from_baseaddr_in_memory load r3 to_baseaddr_in_memory :LOOPBEGIN bz r1 :END ; if i == 0, then exit this loop add r4 r2 r1 ; r4 = from_baseaddr + i load r4 r4; r4 = $(from_baseaddr + i) add r5 r3 r1 ; r5 = to_baseaddr + i store r5 r4 ; $(to_baseaddr + i) = r4 sub r1 r1 1 ; i-- jrel :LOOPBEGIN :END

total: 10 instructions

how can this be modified to use only 4 registers? here's one way :

(note: i havent run this, it probably has bugs)

load r1 scratch_memory_addr_in_memory ; scratch_memory_addr is a block of memory where we'll store stuff ; that we would have liked to put into registers

; load i into r3, initializing it to n load r3 n_in_memory

; load from_baseaddr into scratch+0 load r4 from_baseaddr_in_memory store r1 r4

; load to_baseaddr into scratch+1 load r4 to_baseaddr_in_memory loadi r2 1 add r2 r1 r2 store r2 r4

:LOOPBEGIN ; if i == 0, exit the loop bz r3 :END ; if i == 0, then exit this loop

; load from_baseaddr into r4 load r4 r1 ; r4 = from_baseaddr

; load from_baseaddr + i into r4 add r4 r4 r3

; load $(from_baseaddr + i) into r4 load r4 r4

; right now r1 = scratch, r2 is unused, r3 = i, r4 = the thing we want to copy

; load to_baseaddr into r2 loadi r2 1 add r2 r1 r2 load r2 r2 ; r2 = to_baseaddr

; load to_baseaddr + i into r2 add r2 r2 r3

; store the thing we want to copy into $(to_baseaddr + i) store r2 r4

sub r3 r3 1 ; i--

jrel :LOOPBEGIN :END

total: 19 instructions

how are we using our 4 registers? well:

In almost any case, when we want to work with more variables than we have registers for, we'll need something like r1 and r2. And we'll always need at least one register for data manipulation (r4). So we only have one register left (r3) to persistently hold a local variable so that we don't have to constantly be loading and saving that local variable from/to scratch; in this case, we use that extra spot to hold the loop counter. So it seems like we have 2 registers for the overhead of swapping data in and out of scratch, 1 register as the bare minimum for computation, leaving only one register for caching local variables. So you can see why 4 registers is barely enough, and why even 8 registers would be much better; with 8 regs, you'd still have the same 3-register overhead/bare minimum, but then the registers free for caching local vars goes up from 1 to 5, a 5x gain.

so, in this case the code size is multiplied by 1.9 (i'm sure someone else could come up with something more efficient, but this is just a ballpark estimate) in exchange for a reducing from the more efficient encoding of almost 2; total savings < 1.9/2 (i say 'almost' 2 because, assuming we also want to offer the 16-bit encoding, we have to spend a bit every now and then to switch encodings or to say which encoding we are currently in). In actuality, there is probably no savings; that 'almost' probably adds at least 1 extra instruction, possibly as much as 3. Now, surely someone else could come up with a more efficient way to write this algorithm than my naive version, and surely there are other algorithms which benefit more; but overall, based on this example i expect only a small savings in encoding space, and i don't think this small savings is worth the additional complexity of having another encoding.

so, no, we won't actually offer this 8-bit encoding.

---

but wait; if the problem with the 8-bit encoding is just the number of registers, why not use a 2-operand format instead of 3-operand, and increase the bits per operand from 2 to 3? Now we can't use our TWO ONE ZERO scheme to encode 13 instructions; instead we'd have 10 instructions:

ONE instr1 x y instr2 x y instr3 x y

ZERO instr4 x instr5 x instr6 x

instr7 instr8 instr9 instr10

the 12 instructions we wanted were: annotate, loadi, load, store, add, sub, leq, bz, jrel, cpy, jd, halt

we can give up annotate, but it's difficult to give up any of the others. I suppose jrel has to go. But we don't have enough 2-operand spots to fit all of loadi, load, store, add, sub, leq, cpy, bz. But we could use an accumulator (a 'default register' that we use when we don't have space to specify one):

ONE cpy load store

ZERO loadi add jd

leq sub bz halt

If bz has no arguments, then we can't say how far to branch, so it must really just be SKIPZ. But now branching is really annoying; we don't have jrel, but SKIPZ only gives us one instruction which is skipped over (one instruction which is conditionally executed in the case that the accumulator is zero). Without JREL, if we want to do a PC-relative jump upon zero, we have to load the PC, load a constant, add them together, the jd to that. Since we only have 1 instruction which is skipped, we need to calculate the jd target before the SKIPZ and keep it in a register (slightly inefficient because this code is executed unconditionally instead of just when it is needed). Alternately, we could just define SKIPZ to always skip over 4 instructions instead of 1 (slightly inefficient in the case when we only needed to do something shorter than that in the conditional branch). We don't have a LOADPC instruction, but that's okay, we have 8 registers, just make one of them the PC.

Since instruction 'sub' only has zero arguments, we'll say that it always subtracts r2 from r1.

so let's try the memcpy-like algorithm again. Remember that now we have 8 registers. R1 is the accumulator:

(again, i haven't run this, it is almost certainly buggy)

load r3 n_in_memory ; r3 = i = n load r4 from_baseaddr_in_memory load r5 to_baseaddr_in_memory :LOOPBEGIN

r6 =
END cpy r6 PC loadi 25 ; acc = 25 ;; actually this would be a longer sequence of instructions, we can only directly loadi things between 0 and 8 (3 bits) add r6 ; acc += r6 ;; oops i forgot a cpy r6 acc here
r7 =
ifnzero2 cpy r7 PC loadi 6 ; acc = 6 add r7 ; acc += r7 ;; oops i forgot a cpy r7 acc here

cpy r1 r3 ; acc = i skipz; if i == 0, then skip 1 :ifnzero1 jd r7 :ifzero jd r6 :ifnzero2 cpy r1 r4 ; acc = from_baseaddr add r3 ; acc += i load r7 r1 ; r7 = $(from_baseaddr + i)

cpy r1 r5 ; acc = to_baseaddr add r3 ; acc += i store r1 r7 ; $(to_baseaddr + i) = r6

; r3-- loadi 1 cpy r2 r1 cpy r1 r3 sub cpy r3 r1

jrel
LOOPBEGIN loadi 24 ; acc = 24 ;; actually this would be a longer sequence of instructions, we can only directly loadi things between 0 and 8 (3 bits) cpy r2 r1 cpy r1 PC sub jd r1

:END

total: 29 instructions. Even worse.

but wait, the above encoding is wrong; it's true that we only have 4 opcodes for 2-operand instructions, because we have 4 opcode bits, but then we have 8 opcodes for one-operand instructions, and 8 more for zero operand instructions, because these are using the 3-bit operand fields as an extended opcode. So we can add annotate and jrel back in and make leq and sub and bz one-operand instead of zero-operand:

(i'm not going to recompute the jump offsets, so i just put a '~' in them to indicate that they're wrong)

load r3 n_in_memory ; r3 = i = n load r4 from_baseaddr_in_memory load r5 to_baseaddr_in_memory :LOOPBEGIN

r6 =
END loadi ~25 ; acc = ~25 ;; actually this would be a longer sequence of instructions, we can only directly loadi things between 0 and 8 (3 bits) cpy r6 acc
r7 =
ifnzero2 loadi ~6 ; acc = ~6 cpy r7 acc

bz r3; if i == 0, then skip 1 :ifnzero1 jrel r7 :ifzero jrel r6 :ifnzero2 cpy r1 r4 ; acc = from_baseaddr add r3 ; acc += i load r7 r1 ; r7 = $(from_baseaddr + i)

cpy r1 r5 ; acc = to_baseaddr add r3 ; acc += i store r1 r7 ; $(to_baseaddr + i) = r6

; r3-- loadi 1 cpy r2 r1 cpy r1 r3 sub r2 cpy r3 r1

jrel
LOOPBEGIN loadi ~-24 ; acc = ~24 ;; actually this would be a longer sequence of instructions, we can only directly loadi things between 0 and 8 (3 bits) (note: we actually have uints, but jrel interprets them as signed) jrel acc

:END

total: 23 instructions. Still much worse than 8.

Well, what if we made the opcode field 3 bits, and made the second operand field 2 bits? Now we can't have a 'cpy' instruction that can copy any register into any other register, but we can have 'cpy' and 'cpyreverse' that vary in whether the first or second (long or short) operand field is the source, allowing us to accomplish any->any copying in two instructions.

we have:

ONE cpy cpyrev load store loadi add bz

ZERO sub leq jd jrel halt ...

and we have:

load r3 n_in_memory ; r3 = i = n load r5 from_baseaddr_in_memory load r6 to_baseaddr_in_memory :LOOPBEGIN

r4 =
END loadi ~25 r4

bz r3 r4; if i == 0, then goto :END

add r5 r3 ; acc = from_baseaddr + i load r4 r1 ; r4 = $(from_baseaddr + i)

cpy r1 r5 ; acc = to_baseaddr add r6 r3 ; acc = to_baseaddr + i store r1 r4 ; $(to_baseaddr + i) = r4

; r3-- loadi 1 r4 cpy r1 r3 sub r4 cpy r3 r1

jrel
LOOPBEGIN loadi ~-24 r4 jrel r4 :END

so we're at 16 instructions. Which is only slightly better than 18.

So no, we won't offer an 8-bit encoding.

---

---

OLD: Note: although there is no instruction named EQ-UINT in Boot, you can do a BITXOR between two values, which will result in 0 if and only if the values are equal.

(now we have to do two leqs, since bitxor is moved to bootx)

---

---

---

"gcc is made to be portable as long as your architecture fits some predefined notions (for example, at least 32 bit integers and a flat address space)."

---

" Assembly language is fine, but you really want a high level language. Of course, your first thought will be to port gcc, which is a great C and C++ compiler (among other things). There’s good news, bad news, and worse news. The good news is that gcc is made to be portable as long as your architecture fits some predefined notions (for example, at least 32 bit integers and a flat address space). The bad news is that it is fairly difficult to do a port. The worst news is there is only a limited amount of documentation and a lot of it is very out of date.

Still, it is possible. There are only three things you have to create to produce a cross compiler:

    A machine description
    A machine header
    Some machine-specific functions

However, building these is fairly complex and uses a Lisp-like notation that isn’t always intuitive. If you want to tackle it, there are several documents of interest. There’s a very good slide show overview, very out of date official documentation, and some guy’s master’s thesis. However, be prepared to read a lot of source code and experiment, too. Then you’ll probably also want to port gdb, which is also non-trivial (see the video below).

https://www.youtube.com/watch?v=kgFr4Jnhff0

There are other C compilers. The llvm project has clang which you might find slightly easier to port, although it is still not what I would consider trivial. The lcc compiler started out as a book in 1995. It uses iburg to do code generation, and that tool might be useful with some other retargeting projects, as well. Although the vbcc compiler isn’t frequently updated, the documentation of its backend looks very good and it appears to be one of the easier compilers to port. There is a portable C compiler, PCC, that is quite venerable. I’ve seen people port some of the “small C” variants to a different CPU, although since they aren’t standard C, that is only of limited use.

Keep in mind, there’s more to doing a gcc port than just the C compiler. You’ll need to define your ABI (Application Binary Interface; basically how memory is organized and arguments passed). You’ll also need to provide at least some bootstrap C library, although you may be able to repurpose a lot of the standard library after you get the compiler working.

So maybe the C compiler is a bit much. There are other ways to get a high level language going. Producing a workable JVM (or other virtual machine) would allow you to cross compile Java and is probably less work overall. Still not easy, though, and the performance of your JVM will probably not be even close to a compiled program. I have found that versions of Forth are easy to get going. Jones on Forth is a good place to start if you can find a backup copy of it.

If you do bite the bullet and build a C compiler, the operating system is the next hurdle. Most Linux builds assume you have advanced features like memory management. There is a version, uClinux, that might be slightly easier to port. You might be better off looking at something like Contiki or FreeRTOS?.

...

 Al Williams says:	July 31, 2015 at 10:36 am

Yeah as far as I know uClinux is the only real choice (other than totally roll your own) that can be happy without an MMU. As long as you have one processor, the MMU isn’t bad. Gets ugly as you add cores though (depending on your memory bus architecture).

nes says: July 31, 2015 at 4:46 pm

I find getting the MMU working right to consistently be the biggest pain on any SoC?. The mainline Linux kernel can be configured to work without one though and there’s support in devicetree for some ARM chips with no MMU. Seen several pop up on the mailing list just recently.

If not intending to run Linux then I would start by trimming down an existing simple ISA like MIPS 2k and port a simple compiler which already has support for it like lcc. If the target is bare metal then you could get by without binutils and libc which also saves a lot of effort. Should be doable in a few evenings. Report comment Reply

"

[6]

note: i saved the download from http://www.drdobbs.com/developer-network-small-c-compiler-book/184415519 to archive9. It says " This ISO image of a CD-ROM contains the definitive collection of Small-C related information and source code. These pages include the full text to James Hendrix's book A Small-C Compiler: Language, Usage, Theory, and Design."

---

let's analyze the classes of core instructions as well as the instructions we added that took us beyond 16 to 29:

clearly core instructions:

total: 8

probably core instructions:

total: 16 (9 clearly core + 8 probably core)

other things we really really want:

other things we added:

and maybe loadcodeptr, although that makes it 30, not 29.

---

one thing i notice is that other instruction sets seem to be able to do more with less instructions, compared to our 60 instruction BootX?. For example,

LuaVM? has only 37 instructions ([7] page 4), yet that's enough for a high-level VM including signed arithmetic including div/mod/pow, tables, call/return, for loops. Shen's KLambda [8] has "57 primitive functions, special forms and required symbols" which is a high-level core language including exceptions, > < <= >=, strings, vectors, cons lists, open/close/read/write/stdin/stdout, eval, time, and 6 sysinfo type queries.

What are we spending opcodes on that these guys arent?

todo

---

in LLVM, " the malloc instruction was removed because it no longer offered any advantages over recognizing the standard library call "malloc". So the frontend should just be modified to generate a call to the "malloc" function. " Bn4z-k

LLVM still has an instruction for allocation on the stack

---

" Linux Only

A lot of functionality we rely on is simply not available on other operating systems. dbus-daemon(1) is still around (and will stay around), so there will always be a working D-Bus Message Bus for other operating systems.

Note that we rely on several peculiar features of the linux kernel to implement a secure message broker (including its accounting for inflight FDs, its output queueing on AF_UNIX including the IOCOUTQ ioctl, edge-triggered event notification, SO_PEERGROUPS ioctl, and more). We fixed several bugs upstream just few weeks ago, and we will continue to do so. But we are not in a position to review other kernels for the same guarantees.

" [9]

---

" Animats 6 days ago [-]

We rather consider a bus a set of distinct peers with no global state.

If they've gone that far, they may as well implement QNX messaging, which is known to work well. QNX has an entire POSIX implementation based on QNX's messaging system, so it's known to work. Plus it does hard real time.

The basic primitives work like a subroutine call. There's MsgSend? (send and wait for reply), MsgReceive? (wait for a request), and MsgReply? (reply to a request). There's also MsgSendPulse? (send a message, no reply, no wait) but it's seldom used. Messages are just arrays of bytes; the messaging system has no interest in content. Receivers can tell the process ID of the sender, so they can do security checks. All I/O is done through this mechanism; when you call "write()", the library does a MsgSend?.

Services can give their endpoint a pathname, so callers can find them.

The call/reply approach makes the hard cases work right. If the receiver isn't there or has exited, the sender gets an error return. There's a timeout mechanism for sending; in QNX, anything that blocks can have a timeout. If a sender exits while waiting for a reply, that doesn't hurt the receiver. So the "cancellation" problem is solved. If you wan to do something else in a process while waiting for a reply, you can use more threads in the sender. On the receive side, you can have multiple threads taking requests via MsgReceive?, handling the requests, and replying via MsgReply?, so the system scales.

CPU scheduling is integrated with messaging. On a MsgSend?, CPU control is usually transferred from sender to receiver immediately, without a pass through the scheduler. The sending thread blocks and the receiving thread unblocks.

With unidirectional messaging (Mach, etc.) and async systems, it's usually necessary to build some protocol on top of messaging to handle errors. It's easy to get stall situations. ("He didn't call back! He said he'd call back! He promised he'd call back!") There's also a scheduling problem - A sends to B but doesn't block, B unblocks, A waits on a pipe/queue for B and blocks, B sends to A and doesn't block, A unblocks. This usually results in several trips through the scheduler and bad scheduling behavior when there's heavy traffic.

There's years (decades, even) of success behind QNX messaging, yet people keep re-inventing the wheel and coming up with inferior designs.

reply "

that sounds a lot like Urbit's RPC thingee, but without the stack.

---

toread https://webkit.org/blog/7846/concurrent-javascript-it-can-work/

---

---

AssemblyScript

https://github.com/AssemblyScript/assemblyscript

a subset of TypeScript? that compiles to WebAssembly?

" AssemblyScript? tries to support its features as closely as reasonable while not supporting certain dynamic constructs intentionally:

    All types must be annotated to avoid possibly unwanted implicit type conversions
    Optional function parameters require an initializer expression
    Union types (except classType | null representing a nullable), any and undefined are not supported by design
    The result of logical && / || expressions is always bool

...

Once configured, the following AssemblyScript?-specific types become available: Type Aliases Native type sizeof Description i8 int8, sbyte i32 1 An 8-bit signed integer. u8 uint8, byte i32 1 An 8-bit unsigned integer. i16 int16, short i32 2 A 16-bit signed integer. u16 uint16, ushort i32 2 A 16-bit unsigned integer. i32 int32, int i32 4 A 32-bit signed integer. u32 uint32, uint i32 4 A 32-bit unsigned integer. i64 int64, long i64 8 A 64-bit signed integer. u64 uint64, ulong i64 8 A 64-bit unsigned integer. usize uintptr i32 / i64 4 / 8 A 32-bit unsigned integer when targeting 32-bit WebAssembly?. A 64-bit unsigned integer when targeting 64-bit WebAssembly?. f32 float32, float f32 4 A 32-bit float. f64 float64, double f64 8 A 64-bit float. bool - i32 1 A 1-bit unsigned integer. void - none - No return type

While generating a warning to avoid type confusion, the JavaScript? types number and boolean resolve to f64 and bool respectively.

WebAssembly?-specific operations are available as built-in functions that translate to the respective opcode directly:

    rotl(value: i32, shift: i32): i32
    Performs the sign-agnostic rotate left operation on a 32-bit integer.
    rotll(value: i64, shift: i64): i64
    Performs the sign-agnostic rotate left operation on a 64-bit integer.
    rotr(value: i32, shift: i32): i32
    Performs the sign-agnostic rotate right operation on a 32-bit integer.
    rotrl(value: i64, shift: i64): i64
    Performs the sign-agnostic rotate right operation on a 64-bit integer.
    clz(value: i32): i32
    Performs the sign-agnostic count leading zero bits operation on a 32-bit integer. All zero bits are considered leading if the value is zero.
    clzl(value: i64): i64
    Performs the sign-agnostic count leading zero bits operation on a 64-bit integer. All zero bits are considered leading if the value is zero.
    ctz(value: i32): i32
    Performs the sign-agnostic count tailing zero bits operation on a 32-bit integer. All zero bits are considered trailing if the value is zero.
    ctzl(value: i64): i64
    Performs the sign-agnostic count trailing zero bits operation on a 64-bit integer. All zero bits are considered trailing if the value is zero.
    popcnt(value: i32): i32
    Performs the sign-agnostic count number of one bits operation on a 32-bit integer.
    popcntl(value: i64): i64
    Performs the sign-agnostic count number of one bits operation on a 64-bit integer.
    abs(value: f64): f64
    Computes the absolute value of a 64-bit float.
    absf(value: f32): f32
    Computes the absolute value of a 32-bit float.
    ceil(value: f64): f64
    Performs the ceiling operation on a 64-bit float.
    ceilf(value: f32): f32
    Performs the ceiling operation on a 32-bit float.
    floor(value: f64): f64
    Performs the floor operation on a 64-bit float.
    floorf(value: f32): f32
    Performs the floor operation on a 32-bit float.
    sqrt(value: f64): f64
    Calculates the square root of a 64-bit float.
    sqrtf(value: f32): f32
    Calculates the square root of a 32-bit float.
    trunc(value: f64): f64
    Rounds to the nearest integer towards zero of a 64-bit float.
    truncf(value: f32): f32
    Rounds to the nearest integer towards zero of a 32-bit float.
    nearest(value: f64): f64
    Rounds to the nearest integer tied to even of a 64-bit float.
    nearestf(value: f32): f32
    Rounds to the nearest integer tied to even of a 32-bit float.
    min(left: f64, right: f64): f64
    Determines the minimum of two 64-bit floats. If either operand is NaN, returns NaN.
    minf(left: f32, right: f32): f32
    Determines the minimum of two 32-bit floats. If either operand is NaN, returns NaN.
    max(left: f64, right: f64): f64
    Determines the maximum of two 64-bit floats. If either operand is NaN, returns NaN.
    maxf(left: f32, right: f32): f32
    Determines the maximum of two 32-bit floats. If either operand is NaN, returns NaN.
    copysign(x: f64, y: f64): f64
    Composes a 64-bit float from the magnitude of x and the sign of y.
    copysignf(x: f32, y: f32): f32
    Composes a 32-bit float from the magnitude of x and the sign of y.
    reinterpreti(value: f32): i32
    Reinterprets the bits of a 32-bit float as a 32-bit integer.
    reinterpretl(value: f64): i64
    Reinterprets the bits of a 64-bit float as a 64-bit integer.
    reinterpretf(value: i32): f32
    Reinterprets the bits of a 32-bit integer as a 32-bit float.
    reinterpretd(value: i64): f64
    Reinterprets the bits of a 64-bit integer as a 64-bit double.
    current_memory(): i32
    Returns the current memory size in units of pages. One page is 64kb.
    grow_memory(value: i32): i32
    Grows linear memory by a given unsigned delta of pages. One page is 64kb. Returns the previous memory size in units of pages or -1 on failure.
    unreachable(): void
    Emits an unreachable operation that results in a runtime error when executed.
    load<T>(offset: usize): T
    Loads a value of the specified type from memory.
    store<T>(offset: usize, value: T): void
    Stores a value of the specified type to memory.

The following AssemblyScript?-specific operations are implemented as built-ins as well:

    sizeof<T>(): usize
    Determines the byte size of the specified core or class type. Compiles to a constant.
    unsafe_cast<T1,T2>(value: T1): T2
    Casts a value of type T1 to a value of type T2. Useful for casting classes to pointers and vice-versa. Does not perform any checks.
    isNaN(value: f64): bool
    Tests if a 64-bit float is a NaN.
    isNaNf(value: f32): bool
    Tests if a 32-bit float is a NaN.
    isFinite(value: f64): bool
    Tests if a 64-bit float is finite.
    isFinitef(value: f32): bool
    Tests if a 32-bit float is finite.

These constants are present as immutable globals (note that optimizers might inline them):

    NaN: f64
    NaN (not a number) as a 64-bit float.
    NaNf: f32
    NaN (not a number) as a 32-bit float.
    Infinity: f64
    Positive infinity as a 64-bit float.
    Infinityf: f32
    Positive infinity as a 32-bit float.

By default, AssemblyScript?'s memory management runtime will be linked statically:

    memcpy(dest: usize, src: usize, size: usize): usize
    Copies data from one chunk of memory to another.
    memset(dest: usize, c: i32, size: usize): usize
    Sets a chunk of memory to the provided value c. Usually used to reset it to all 0s.
    memcmp(vl: usize, vr: usize, n: usize): i32
    Compares a chunk of memory to another. Returns 0 if both are equal, otherwise vl[i] - vr[i] at the first difference's byte offset i.
    malloc(size: usize): usize
    Allocates a chunk of memory of the specified size.
    realloc(ptr: usize, size: usize): usize
    Changes the size of an allocated memory block.
    free(ptr: usize): void
    Frees a previously allocated chunk of memory.

Linking in the runtime adds up to 14kb to a module, but the optimizer is able to eliminate unused runtime code. "

---

spopejoy 84 days ago [-]

> It sure makes me wonder if Ethereum would do better with a less forgiving programming language.

This will be hard. While Solidity certainly has problems unto itself, some of its insecurity comes from the EVM's design, which is almost laughably low level and thus very hard to reason about. It certainly doesn't seem to be informed by modern VMs like LLVM, JVM or BEAM, which know a great deal more about the semantics of the program they're running and have things like dispatching features. My guess is the approach was "Bitcoin with a few more opcodes" and therefore more like a 80s-era CPU than a "VM".

As a result, the compiler is tasked with running the whole show. Add to this the coupling of RPC to Solidity's mangle-check-and-jump dispatch approach, and you start to see why there's been so little innovation in this area: Solidity has a tight grip on the Ethereum ecosystem. Also, writing a compiler to this substrate is not easy, and you're penalized for code size (there's a limit on how big a contract can be).

---

" A lot of the limitations of Solidity are due to constraints of the EVM, but the EVM is evolving. For instance, the next planned hard fork will enable the EVM to pass dynamically sized data (e.g. solidity arrays or strings) between call stacks. Previously arrays had to be fixed-size, so you'd have to define a fixed maximum size, then always return data of that size (usually a lot of empty elements). This feature, like many others, is somewhat challenging to design because every execution step of the EVM must be metered by a "gas fee"; the first version of the EVM kept it simple by only allowing return data to be a fixed size. See these issues for background "

---

" The XREAD command is designed in order to read, at the same time, from multiple streams just specifying the ID of the last entry in the stream we got. Moreover we can request to block if no data is available, to be unblocked when data arrives. Similarly to what happens with blocking list operations, but here data is not consumed from the stream, and multiple clients can access the same data at the same time.

This is a canonical example of XREAD call:

> XREAD BLOCK 5000 STREAMS mystream otherstream $ $

And it means: get data from “mystream” and “otherstream”. If no data is available, block the client, with a timeout of 5000 milliseconds. After the STREAMS option we specify the keys we want to listen for, and the last ID we have. However a special ID of “$” means: assume I’ve all the elements that there are in the stream right now, so give me just starting from the next element arriving.

If, from another client, I send the commnad:

> XADD otherstream * message “Hi There”

This is what happens on the XREAD side:

1) 1) "otherstream" 2) 1) 1) 1506935385635.0 2) 1) "message" 2) "Hi There"

We get the key that received data, together with the data received. In the next call, we’ll likely use the ID of the last message received:

> XREAD BLOCK 5000 STREAMS mystream otherstream $ 1506935385635.0

And so forth. However note that with this usage pattern, it is possible that the client will connect again after a very big delay (because it took time to process messages, or for any other reason). In such a case, in the meantime, a lot of messages could pile up, so it is wise to always use the COUNT option with XREAD, in order to make sure the client will not be flooded with messages and the server will not have to lose too much time just serving tons of messages to a single client.

...

Consumer groups (work in progress)

This is the first of the features that is not already implemented in Redis, but is a work in progress. It is also the idea more clearly inspired by Kafka, even if implemented here in a pretty different way. The gist is that with XREAD, clients can also add a “GROUP <name>” option. Automatically all the clients in the same group will get *different* messages. Of course there could be multiple groups reading from the same stream, in such cases all groups will receive duplicates of the same messages arriving in the stream, but within each group, messages will not be repeated.

An extension to groups is that it will be possible to specify a “RETRY <milliseconds>” option when groups are specified: in this case, if messages are not acknowledged for processing with XACK, they will be delivered again after the specified amount of milliseconds. This provides some best effort reliability to the delivering of the messages, in case the client has no private means to mark messages as processed. This part is a work in progress as well. "

---

https://en.wikipedia.org/wiki/Seccomp

seccomp (short for secure computing mode) is a computer security facility in the Linux kernel. It was merged into the Linux kernel mainline in kernel version 2.6.12, which was released on March 8, 2005.[1] seccomp allows a process to make a one-way transition into a "secure" state where it cannot make any system calls except exit(), sigreturn(), read() and write() to already-open file descriptors. Should it attempt any other system calls, the kernel will terminate the process with SIGKILL. In this sense, it does not virtualize the system's resources but isolates the process from them entirely.

seccomp mode is enabled via the prctl(2) system call using the PR_SET_SECCOMP argument, or (since Linux kernel 3.17[2]) via the seccomp(2) system call.[3] seccomp mode used to be enabled by writing to a file, /proc/self/seccomp, but this method was removed in favor of prctl().[4] In some kernel versions, seccomp disables the RDTSC x86 instruction, which returns the number of elapsed processor cycles since power-on, used for high-precision timing.[5]

seccomp-bpf is an extension to seccomp[6] that allows filtering of system calls using a configurable policy implemented using Berkeley Packet Filter rules. It is used by OpenSSH? and vsftpd as well as the Google Chrome/Chromium web browsers on Chrome OS and Linux.[7] (In this regard seccomp-bpf achieves similar functionality to the older systrace—which seems to be no longer supported for Linux).

---