proj-oot-ootAssemblyNotes18

---

should probably read at least

https://en.wikipedia.org/wiki/RISC-V

---

for OSs, hypervisors, etc, some thoughts:

" the ecall instruction, the common instruction used to call the next most privileged layer in the processor. For example, an ecall executed at the User layer will likely be handled at the Supervisor layer. An ecall executed at the Supervisor layer will be handled by the Machine layer. (This is a simplistic explanation that can get more complex with trap redirection, but we won't dive into those waters at this moment).

So, when the Supervisor (Linux kernel) executes ecall, the Machine layer's trap handler is executed in Machine mode. The code can be found in the riscv-pk at trap 9, the mcall_trap function, in machine/mtrap.c. ... The RISC-V privilege model was initially designed as an ecosystem that consists of four separate layers of privilege: User, Supervisor, Hypervisor, and Machine. The User privilege layer is, of course, the least privileged layer, where common applications are executed. Supervisor is the privilege layer where the operating system kernel (such as Linux, mach, or Amoeba) lives. The Hypervisor layer was intended to be the layer at which control subsystems for virtualization would live, but has been deprecated in more recent versions of the privilege specification. ... Each privilege layer is presumed to be isolated from all lower privileged layers during code execution, as one would expect. The CPU itself ensures that registers attributed to a specific privilege layer cannot be accessed from a less privileged layer. Thus, as a policy, Supervisor layer code can never access Machine layer registers. This segmentation helps guarantee that the state of each layer cannot be altered by lower privileged layers.

However, the original privilege specification defined memory protection in two separate places. First, the mstatus register's VM field defines what memory protection model shall be used during code execution. "

-- http://blog.securitymouse.com/2017/04/the-risc-v-files-supervisor-machine.html

to return after an ecall, "sret (supervisor exception return)"

---

where are my notes on that Princeton 'tricheck' paper by Trippel, Manerkar, Lustig, Pellauer, Martonosi, that claimed to find problems in RISC-V's memory consistency meodel/memory ordering/concurrency and that got some press around early 2017?

http://mrmgroup.cs.princeton.edu/papers/ctrippel_ASPLOS17.pdf

i recall skimming some portions of that paper, surely i took notes somewhere, particularly on their recommendations (search the PDF above for 'recommend'; also section 5.1.3 contains a recommendation without using that word) (the following are quotes):

anyways, in case i DIDN'T take notes, there's the notes.

what they are doing about it (forming a task group to revise the memory consistency model): https://riscv.org/2017/04/risc-v-memory-consistency-model/

may 10 talk: Status of the RISC-V Memory Consistency Model https://riscv.org/2017/05/6th-risc-v-workshop-proceedings/ https://riscv.org/wp-content/uploads/2017/05/Wed1000am-Memory-Model-Lustig.pdf https://youtu.be/E5s54AVGV2E

mailing list search: https://groups.google.com/a/groups.riscv.org/forum/#!searchin/isa-dev/memory$20consistency$20model

their task group formation announcement on the mailing list, with details about design choices to be considered:

https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/Oxm_IvfYItY/discussion

google search: https://www.google.com/search?q=risc-v+Memory+consistency+Model&safe=active&client=ubuntu&channel=fs&source=lnt&tbs=cdr%3A1%2Ccd_min%3Amay+2017%2Ccd_max%3A&tbm=

the task group (committee) issued a Memory Consistency Model Addendum 2.2 saying what to do in the meantime to be conservative:

https://docs.google.com/viewer?a=v&pid=forums&srcid=MDQwMTcyODgwMjc3MjQxMjA0NzcBMDUwNzQ0NzcxMjczNjI2NzQwNDEBczVCLTc5VWtCd0FKATAuMQFncm91cHMucmlzY3Yub3JnAXYy mailing list discussion: https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/-p9ch4V9bKM/discussion

the above searches were done on Oct 4 2017.

also, the May talk looks like a good intro to what sorts of things the issues are:

https://riscv.org/wp-content/uploads/2017/05/Wed1000am-Memory-Model-Lustig.pdf https://www.google.com/search?q=risc-v+Memory+consistency+Model&safe=active&client=ubuntu&channel=fs&source=lnt&tbs=cdr%3A1%2Ccd_min%3A5%2F28%2F2017%2Ccd_max%3A&tbm=


" There are ongoing efforts to specify memory models for multi- threaded programming in C, C++ [57] and other languages. These efforts are influenced by the type of memory models that can be supported efficiently on existing architectures like x86, POWER and ARM. While the memory model for x86 [46, 51, 54] is cap- tured succinctly by the Total Store Order (TSO) model, the models for POWER [52] and ARM [25] are considerably more complex. The formal specifications of the POWER and ARM models have required exposing microarchitectural details like speculative exe- cution, instruction reordering and the state of partially executed in- structions, which, in the past, have always been hidden from the user. ... SC [38] is the most intuitive memory model, but naive implemen- tations of SC suffer from poor performance. ... Instead the manufactures and researchers have chosen to present weaker memory model interfaces, e.g. TSO [58], PSO [61], RMO [61], x86 [46, 51, 54], Processor Consistency [30], Weak Consis- tency [24], RC [27], CRF [55], POWER [33] and ARM [9]. The tutorials by Adve et al. [1] and by Maranget et al. [44] provide re- lationships among some of these models. The lack of clarity in the definitions of POWER and ARM mem- ory models in their respective company documents has led some researchers to empirically determine allowed/disallowed behaviors [8, 25, 41, 52]. Based on such observations, in the last several years, both axiomatic models and operational models have been devel- oped which are compatible with each other [3–5, 7, 8, 25, 41, 52, 53]. However, these models are quite complicated; for example, the POWER axiomatic model has 10 relations, 4 types of events per in- struction, and 13 complex axioms [41], some of which have been added over time to explain specific behaviors [4–6, 41]. The ab- stract machines used to describe POWER and ARM operationally are also quite complicated, because they require the user to think in terms of partially executed instructions [52, 53]. ... Adve et al. defined Data-Race-Free-0 (DRF0), a class of pro- grams where shared variables are protected by locks, and proposed that DRF0 programs should behave as SC [2]. Marino et al. im- proves DRF0 to the DRFx model, which throws an exception when a data race is detected at runtime [45]. However, we believe that architectural memory models must define clear behaviors for all programs, and even throwing exceptions is not satisfactory enough.

A large amount of research has also been devoted to specifying the memory models of high-level languages, e.g. C/C++ [12–15, 17, 34, 35, 37, 49, 57] and Java [18, 20, 23, 42, 43]. There are also proposals not tied to any specific language [19, 22]. This remains an active area of research because a widely accepted memory model for high-level parallel programming is yet to emerge, while this paper focuses on the memory models of underlying hardware "

-- An Operational Framework for Specifying Memory Models using Instantaneous Instruction Execution

---

from [1] :

hierarchy of common memory consistency model strengths, from strongest to weakest:

a "woefully incomplete" characterization of these memory consistency models: consider reorderings of the following instruction pairs: load/load, load/store, store/load, store/store:

SEQUENTIAL CONSISTENCY:

" 1. All threads are interleaved into a single “thread” 2. The interleaved thread respects each thread’s original instruction ordering (“program order”) 3. Loads return the value of the most recent store to the same address, according to the interleaving

...

For performance, most processors weaken rule #2, and most weaken #1 as well.

...

Q: Can I think of an execution as an interleaving of the instructions in each thread (in some order)? ... That would make it illegal to forward values from a store buffer!.. Because with a store buffer, cores can read their own writes “early”.

Option 1: forbid store buffer forwarding, keep a simpler memory model, sacrifice performance

Option 2: change the memory model to allow store buffer forwarding, at the cost of a more complex model

Nearly all processors today choose #2

Q: Can I think of an execution as an interleaving of the instructions in each thread (in some order), with an exception for store buffer forwarding?

A:

example of the exception for store buffer forwarding:

2 CPUs. Each CPU has 2 threads: threads 1 and 2 on CPU 1, threads 3 and 4 on CPU 2. Thread 1 (on CPU 1) stores a value to memory location A, then Thread 2 reads from memory location A. Starting at about the same time, Thread 3 (on CPU 2) stores a different value to memory location A, then Thread 4 reads from memory location A. If the time it takes for these stores to propagate between CPUs is short, Thread 2 perceives an ordering on which Thread 1's store came before Thread 3's store, but Thread 4 perceives the opposite ordering. So, in this case, there is no interleaving perceived by all threads.

---

" Memory Model Landscape

Sequential Consistency (SC)

Total Store Order (TSO)

Weaker memory models

...

Architects find SC & TSO constraining

(((but))) Programmers hate weak memory models (((because)))...

Difficult to understand, implementation-driven weak memory models ARM, POWER, RMO, Alpha, etc....

" -- [3]

---

it may be useful to look at what was debated in the RISC-V memory consistency model task group, with the heuristic that these items are the 'unsolved' questions in the field, eg things that RISC-V may get wrong, eg complexities to try to stay away from in OVM:

" Items on the agenda currently include, in rough priority order:

...

" PENDING/POSSIBLE CHANGES TO THE MODEL

Feature Status Multi-copy atomicity: Major debate!

Enforce same-address ordering (including load-load pairs): Required! (((?? but see subsequent May 13 mailing list post https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/-p9ch4V9bKM/Ah1Jb_9-BQAJ "More than likely we won't require a fence between two loads if the second is address-dependent on the first...(((but))) I don't think we're quite ready to commit to anything absolute yet.")))

Forbid load-store reordering (for accesses to different addresses), Enforce ordering of address/control/data-dependent instructions, Which FENCE types? (.pr, .pw, .sr, .sw? Other?): Still sorting out the details!

IT’S? ALWAYS SAFE TO BE CONSERVATIVE

" More than likely we won't require a fence between two loads if the second is address-dependent on the first. Practically speaking, a lot of software (notably Linux) basically assumes that hardware always guarantees this to work, because no major architecture since Alpha has relaxed such orderings.

However, we don't yet have 100% consensus on how exactly to formalize this in the task group, and whether address, control, and data dependencies are all equivalent in strength, whether they apply equally well to read-read vs. read-write orderings, or even whether all of the above should even be enforced. So I don't think we're quite ready to commit to anything absolute yet. " -- [6]

"

"

– Architects should pay careful attention to agressive memory access reordering, aggressive

cache cache coherence protocols, and designs that share store buffers between threads. – Hardware should respect all same-address orderings (including load-load pairs) and any orderings established by address, control, and data dependencies.

C/C++ Construct, Base ISA Mapping, ‘A’ Extension Mapping

Non-atomic Load...

atomic load(memory order consume) ld; fence r,rw atomic load(memory order acquire) ld; fence r,rw atomic load(memory order seq cst) fence rw,rw; ld; fence r,rw

Non-atomic Store...

atomic store(memory order relaxed) sd atomic store(memory order release) fence rw,w; sd amoswap.rl atomic store(memory order seq cst) fence rw,rw; sd fence rw,rw; amoswap

Fences

atomic thread fence(memory order acquire) fence r,rw atomic thread fence(memory order release) fence rw,w atomic thread fence(memory order acq rel) fence rw,rw atomic thread fence(memory order seq cst) fence rw,rw

Furthermore, we recommend compiler writers avoid fences weaker than fence r,rw, fence rw, w, and fence rw, rw until the memory model clarifies their semantics. Additionally, while AMOs with both the aq and rl bits set do imply both aq and rl semantics, we recommend against their use until the memory model clarifies their combined semantics.

" -- RISC-V Memory Consistency Model Addendum 2.2

" Weak memory models: Technical issues

Atomic memory systems

(((is the following an example of an atomic memory system, or an example of a NON-atomic memory system?)))

Consensus: RISC-V memory model definition will rely only on atomic memory " -- [9]

from https://www.bsc.es/sites/default/files/public/u1810/arvind_0.pdf :

"Example: Ld-St Reordering Permitting a store to be issued to the memory before previous loads have completed, allows load values to be affected by future stores in the same thread" For example,

Process 1: r1 = Load(a) Store(b,1)

Process 2: r2 = Load(b) Store(a,r2)

Load-store reordering would allow the '1' stored by Process 1 into b to be loaded into r2 by process 2's load, and then stored into a by process 2's store, and then loaded into r1 from a by process 1! Implementation-wise, here is what could happen:

"

" Load-Store Reordering

Nvidia says it cannot do without Ld-St reordering Although IBM POWER memory model allows this behavior, the server-end POWER processors do not perform this reordering for reliability, availability and serviceability (RAS) reasons MIT opposes the idea because it complicates both the operational and axiomatic definitions, and MIT estimates no performance penalty in disallowing Ld-St reordering

Nevertheless MIT has worked diligently to come up with a model that allows Ld-St ordering (((perhaps e meant REordering? also, MIT's primary proposed model, WMM, prohibits load-store reordering, so note that they are talking about their 'Model X' here, which is detailed later in the slides))) " -- [10]

" C++ operations, WMM instructions Non-atomic Load / Load Relaxed: Ld Load Consumed / Load Acquire: Ld; Reconcile Load SC: Commit; Reconcile; Ld; Reconcile Non-atomic Store / Store Relaxed: St Store Released /Store SC: Commit; St

Compilation from C++11 to WMM C++11 introduces atomic variables in addition to the ordinary (non-atomic) ones

" RISC-V memory model debate is not settled; in spite of lot of research by the Memory Model Committee (Chair Dan Lustig), the community may vote for TSO "

---

figure 1 from [11], broken into parts, and with some details and notes omitted:

Operational model, Axiomatic model:

this suggests that we should restrict our attention to:

which are the ones which have both simple operational and simple axiomatic models

"reasoning (((about))) partially executed instructions...is unavoidable for ARM and POWER operational definitions." [12]

the rest of figure 1 from [13], for these rows only, with some details and notes omitted:

Store atomicity, Allow shared write-through cache/shared store buffer, Instruction reorderings, Ordering of data-dependent loads:

(note: this group's alternative proposal, WMM-S, for which the operational model complexity was 'medium' and no axiomatic model was provided, has non-atomic store atomicity; i think that one point that the authors may be trying to make is that you want at least multi-copy atomicity for clean semantics; in the paper's conclusion they say "Since there is no obvious evidence that restricting to multi-copy atomic stores affects performance or increases hardware complexity, RISC-V should adopt WMM in favor of simplicity.". Elsewhere in the table, in row 'ARM and POWER', store atomicity is classified as 'Non-atomic', although note that [14] says that ARMv8.2 is "(other-/weak-)multi-copy atomic" as opposed to POWER and GPU which are "not multi-copy atomic"). Indeed [15] says "The...manuals for ARMv7 and early ARMv8 described a relaxed memory model, with programmer-visible out-of-order and speculative execution, that was non-multicopy-atomic... The ARMv8 architecture has therefore been revised: it now has a multicopy-atomic model."

multi-copy atomicity is defined here:

" In this paper we propose two weak memory models for RISC-V: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. The difference between the two models is regarding store atomicity, which is often classified into the following three types [19]:

The abstract of [17] defines "non-multicopy-atomic" as "writes could become visible to some other threads before becoming visible to all".

So this suggests that we should restrict our attention to:

which have the common characteristics of at least the following model strengths:

and of which at least one of which permits the following model weaknesses:

---

more background on WMM; the paper says

" The memory model for RISC-V, a newly developed open source ISA, has not been finalized yet and thus, offers an opportunity to evaluate existing memory models. We believe RISC-V should not adopt the memory models of POWER or ARM, because their axiomatic and operational definitions are too complicated. We propose two new weak memory models: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. Both allow all instruction reorderings except overtaking of loads by a store. We show that this restriction has little impact on performance and it considerably simplifies operational definitions. It also rules out the out-of-thin-air problem that plagues many definitions. WMM is simple (it is similar to the Alpha memory model), but it disallows behaviors arising due to shared store buffers and shared write-through caches (which are seen in POWER processors). WMM-S, on the other hand, is more complex and allows these behaviors. " -- https://arxiv.org/pdf/1707.05923.pdf

---

Instantaneous Instruction Execution (I2E) formalism/model invented by MIT group working on RISC-V memory consistency model for formalization of memory consistency models

"

SC in I2E:

TSO in I2E:

Simple and vendor-independent

TSO allows loads to overtake stores " -- [18]

---

so summary of the last few sections:

---

to further summarize the previous summary:

memory ordering consistency models to think about appear to include:

All of SC, TSO, and WMM have the following in common:

Unless we go with SC only, most of the other CPU-style memory ordering consistency models seem to permit store-load reordering. Whether any dependencies are ordered varies between models, and so perhaps should be assumed to be weak.

Yes, a seq cst fence instruction is probably useful. Yes, a seq cst load and a seq cst store are probably useful.

---

later updates on the RISC-V memory ordering consistency model:

171128 status update by Dan Lustig and the Memory Model TG

https://content.riscv.org/wp-content/uploads/2017/12/Tue0954-RISC-V_Memory_Model-Lustig.pdf

the debate was between: Strong Models (e.g., x86-TSO) and Weak Models (e.g., ARM, IBM Power); initial proposals narrowed down to a strong one, RVTSO (RISC-V TSO, similar to SPARC, x86) and a weak one, RVWMO (RISC-V Weak Memory Ordering, similar to ARMv8); Both are multi-copy atomic, so both are simpler than IBM Power and ARMv7.

The decision was to adopt RVWMO (and to offer TSO as an option, 'Ztso'); toolchain like Linux, gcc, bintools will target RVWMO.

ld.rl and sd.aq are deprecated. ld.aqrl and sd.aqrl means RCsc ((release consistency with sequential consistency)), not fully fenced.

in both RVWMO and WVTSO we have:

RVWMO RULES IN A NUTSHELL

Other than the above, A guaranteed to happen before B only if one of:

RVTSO RULES IN A NUTSHELL (these are strictly stronger than RVWMO)

Other than the above, A guaranteed to happen before B only if one of:

mailing list thread, Dec 1, with discussion of the above slide presentation, and a memory-model-spec.pdf (everything in that document appears to have been copied into the RISC-V spec Github, see below regarding links to https://github.com/riscv/riscv-isa-manual , at a later date, so i'd read that instead of this): https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM

The main topic in the list discussion is that some participants strongly support dropping any standardization of a TSO extension, and strongly prefer WMO to TSO. An argument for WMO over TSO is [20]. An argument for not having any sort of officially blessed TSO extension is [21] and [22]; in sum, they argue that, if some chips are TSO, some software writers will happen to have these chips and so will test their programs on these chips, and then they won't realize, or won't care, that their software is TSO-dependent; then later devs for the same software will be forced to only buy TSO-compliant chips; since WMO-compliant software can run on TSO-compliant chips but TSO-compliant software cannot run on WMO-compliant chips, the TSO-compliant chip market will be entrenched; there is a positive feedback loop where if such chips become common then even eg GCC maintainers or writers of important libraries might have them, and so will try to write WMO-compliant chips but fail, and so in the long run the TSO-compliant chips may even dominate the WMO-compliant chip market; they argue that it's better to not bless TSO with an official extension because in many contexts writing software known to depend on nonstandard vendor-specific extensions is less acceptable than writing software meeting a standard.

mb see also https://github.com/riscv/riscv-isa-manual/blob/master/src/memory.tex , which was updated Dec 13, and section Memory Consistency Model in https://github.com/riscv/riscv-isa-manual/blob/master/src/rv32.tex (probably easier just to download and compile the spec and browse the resulting PDF; these parts are Appendix A, memory consistency model, and section "Memory Consistency Model" of Chapter RV32I Base Integer Instruction Set)

from that spec/manual:

" RISC-V Instruction Memory Accesses: l{b

s{blr load lr.aq load-acquire-RCpc lr.aqrl load-acquire-RCsc lr.rl (deprecated) sc store sc.rl store-release-RCpc sc.aqrl store-release-RCsc sc.aq (deprecated) amo<op> load; <op>; store amo<op>.aq load-acquire-RCpc; <op>; store amo<op>.rl load; <op>; store-release-RCpc amo<op>.aqrl load-SC; <op>; store-SC
hwd} load ∗ (∗ : possibly multiple if misaligned)
hwd} store ∗ (∗ : possibly multiple if misaligned)

...

Definition of the RVWMO Memory Model (a lengthly section with a lot of rules)...

Definition of the RVTSO Memory Model

RISC-V cores which implement Ztso impose RVTSO onto all memory accesses. RVTSO behaves just like RVWMO but with the following modifications:

hwdr} instructions behave as if .aq is set
hwdc} instructions behave as if .rl is set

These rules render PPO rules 1 and 8–16 redundant. They also make redundant any non-I/O fences that do not have both .pw and .sr set. Finally, they also imply that all AMO instructions are fully-fenced; nothing will be reordered past an AMO.

"

(note: the opcodes to actually do the above don't exist yet, see [23]; this, combined with a desire for speed (not delaying the TSO memory model until those opcodes are added) appears to be the critical reason why a standard RVTSO is even included, according to that mailing list message)

" A.6 Code Porting Guidelines Normal x86 loads and stores are all inherently acquire and release operations: TSO enforces all load-load, load-store, and store-store ordering by default. All TSO loads must be mapped onto l{b

or onto fence rw,w; s{bl{bibility in case such instructions are added to the ISA one day. However, in the meantime, the assembler will generate the same fence-based and/or amoswap-based versions for these pseudoin- structions. x86 atomics using the LOCK prefix are all sequentially consistent and when ported naively to RISC-V must be marked as .aqrl. A Power sync/hwsync fence, an ARM dmb fence, and an x86 mfence are all equivalent to a RISC-V fence rw,rw. Power isync and ARM isb map to RISC-V fence.i. A Power lwsync map onto fence.tso, or onto fence rw,rw when fence.tso is not available. ARM dmb ld and dmb st fences map to RISC-V fence r,rw and fence w,w, respectively.
hwd}; fence r,rw, and all TSO stores must either be mapped onto amoswap.rl x0
hwd}. Alternatively, TSO loads and stores can be mapped onto
hwd}.aq and s{bhwd}.rl assembler pseudoinstructions to facilitate forwards compat-

A direct mapping of ARMv8 atomics that maps unordered instructions to unordered instructions, RCpc instructions to RCpc instructions, and RCsc instructions to RCsc instructions is likely to work in the majority of cases. Mapping even unordered load-reserved instructions onto lr.aq (particularly for LR/SC pairs without internal data dependencies) is an even safer bet, as this ensures C/C++ release sequences will be respected. However, due to a subtle mismatch between the two models, strict theoretical compatibility with the ARMv8 memory model requires that a naive mapping translate all ARMv8 store conditional and load-acquire operations map onto RISC- V RCsc operations. Any atomics which are naively ported into RCsc operations may revert back to the straightforward mapping if the programmer can verify that the code is not relying on an ordering from the store-conditional to the load-acquire (as this is not common).

The Linux fences smp mb(), smp wmb(), smp rmb() map onto fence rw,rw, fence w,w, and fence r,r, respectively. The fence smp read barrier depends() map to a no-op due to preserved pro- gram order rules 8–10. The Linux fences dma rmb() and dma wmb() map onto fence r,r and fence w,w, respectively, since the RISC-V Unix Platform requires coherent DMA. The Linux fences rmb(), wmb(), and mb() map onto fence ri,ri, fence wo,wo, and fence rwio,rwio, respectively.

The C11/C++11 memory order * primitives should be mapped as shown in Table A.1. The memory order acquire orderings in particular must use fences rather than atomics to ensure that release sequences behave correctly even in the presence of amoswap. The memory order release mappings may use .rl as an alternative.

C/C++ Construct RVWMO Mapping Non-atomic load l{b

atomic load(memory order relaxed) l{batomic load(memory order acquire) l{batomic load(memory order seq cst) fence rw,rw; l{bNon-atomic store s{batomic store(memory order relaxed) s{batomic store(memory order release) fence rw,w; s{batomic store(memory order seq cst) fence rw,rw; s{batomic thread fence(memory order acquire) fence r,rw atomic thread fence(memory order release) fence rw,w atomic thread fence(memory order acq rel) fence.tso atomic thread fence(memory order seq cst) fence rw,rw
hwd}
hwd}
hwd}; fence r,rw
hwd}; fence r,rw
hwd}
hwd}
hwd}
hwd}

Table A.1: Mappings from C/C++ primitives to RISC-V primitives.

It is also safe to translate any .aq, .rl, or .aqrl annotation into the fence-based snippets of Table A.2. These can also be used as a legal implementation of l{b

doinstructions for as long as those instructions are not added to the ISA.
hwd} or s{bhwd} pseu-

Ordering Annotation Fence-based Equivalent l{b

l{bs{bs{bamo<op>.aq amo<op>; fence r,rw amo<op>.rl fence rw,w; amo<op> amo<op>.aqrl fence rw,rw; amo<op>; fence rw,rw
hwdr}.aq l{bhwdr}; fence r,rw
hwdr}.aqrl fence rw,rw; l{bhwdr}; fence r,rw
hwdc}.rl fence rw,w; s{bhwdc}
hwdc}.aqrl fence rw,w; s{bhwdc}

Table A.2: Mappings from .aq and/or .rl to fence-based equivalents. An alternative mapping places a fence rw,rw after the existing s{b

l{b"
hwdc} mapping rather than at the front of the
hwdr} mapping.

note: so, as expected, looks like surrounding stuff with "fence rw,rw" is good enuf, except for I/O, which requires fence rwio,rwio.

so, looks like our just providing one 'fence' instruction (at least initially) is sufficent.

---

so i havent yet read and digested everything in the previous section with an eye towards how it should affect Oot. But some initial notes:

---

issues with MIPS:

tropo 126 days ago [-]

MIPS has numerous defects. There is a legacy wart in the form of a delay slot that doesn't match modern pipelines; this causes all sorts of annoyances. The MMU doesn't use a hardware-walked tree, cutting into performance with cache misses and even code execution. Forming addresses requires a silly number of instructions, or alternately you give up and just load relative to a specific register. The architecture fails to specify a coherent fully physical cache, causing all sorts of performance-killing trouble in OS kernels. There are wasted bits, commonly in the "shamt" field. The "hi" and "lo" registers interfere with scheduling multiplication and division.

bobsam 126 days ago [-]

Are you familiar with the newer mips revisions?

They have been​ modernizing the architecture during the last 10 years.

---

"

The basic difference among RISC ISAs is the load instruction addressing because data has to be loaded first before being used. Therefore the key is access to memory, then data can be used in computation using the simple add, subtract, multiply, divide, and, or, xor instructions. "

---

"Bitcoin Script has some drawbacks. Many operations were disabled by Bit- coin’s creator, Satoshi Nakamoto [21]. This has left Bitcoin Script with a few arithmetic (multiplication was disabled), conditional, stack manipulation, hashing, and digital-signature verification operations.... All Bitcoin Script operations are pure functions of the machine state expect for the signature-verification operations. These signature-verification operations re- quire a hash of some of the transaction data. Together, this means the pro- gram’s success or failure is purely a function of the transaction data. Therefore, the person creating a transaction can know whether the transaction they have created is valid or not...Bitcoin Script is also amenable to static analysis, which is another desirable property. The digital-signature verification operations are the most expensive operations. Prior to execution, Bitcoin counts the number of occurrences of these operations to compute an upper bound on the number of expensive calls that may occur. Programs whose count exceeds a certain threshold are invalid"

---

should probably read this: https://blog.lizzie.io/linux-containers-in-500-loc.html

also interesting note in the discussion:

Bromskloss 1 hour ago [-]

She mentions five Linux kernel mechanisms – "namespaces", "capabilities", "cgroups", and "setrlimit". Is any of those what I should use if I want to run an application inside some kind of container that lets me intercept file system calls (for example for the purpose of creating a file on the fly as it is accessed)?

reply

simcop2387 1 hour ago [-]

Seccomp with ptrace is the way I'd do this. You can setup the rules to signal the ptracing process to intercept the syscall. I've not done it before but it should be possible. Id also look at doing it in a mount namespace with overlayfs on top of everything the process can see, so that you can manipulate anything you want or need filewise without destroying the original system. Then you can copy out any changed files later if you want to preserve them.

reply

---

http://en.cppreference.com/w/c/atomic

has

 atomic_flag_test_and_setatomic_flag_test_and_set_explicit(C11) sets an atomic_flag to true and returns the old value (function) atomic_flag_clearatomic_flag_clear_explicit (C11) sets an atomic_flag to false (function) atomic_init (C11) initializes an existing atomic object (function) atomic_is_lock_free (C11) indicates whether the atomic object is lock-free (function) atomic_storeatomic_store_explicit (C11) stores a value in an atomic object (function) atomic_loadatomic_load_explicit (C11) reads a value from an atomic object (function) atomic_exchangeatomic_exchange_explicit (C11) swaps a value with the value of an atomic object (function) atomic_compare_exchange_strongatomic_compare_exchange_strong_explicitatomic_compare_exchange_weakatomic_compare_exchange_weak_explicit (C11) swaps a value with the an atomic object if the old value is what is expected, otherwise reads the old value (function) atomic_fetch_addatomic_fetch_add_explicit (C11) atomic addition (function) atomic_fetch_subatomic_fetch_sub_explicit (C11) atomic subtraction (function) atomic_fetch_oratomic_fetch_or_explicit (C11) atomic logical OR (function) atomic_fetch_xoratomic_fetch_xor_explicit (C11) atomic logical exclusive OR (function) atomic_fetch_andatomic_fetch_and_explicit (C11) atomic logical AND (function) atomic_thread_fence (C11) generic memory order-dependent fence synchronization primitive (function) atomic_signal_fence (C11) fence between a thread and a signal handler executed in the same thread (function)

atomic_bool _Atomic _Bool atomic_char _Atomic char atomic_schar _Atomic signed char atomic_uchar _Atomic unsigned char atomic_short _Atomic short atomic_ushort _Atomic unsigned short atomic_int _Atomic int atomic_uint _Atomic unsigned int atomic_long _Atomic long atomic_ulong _Atomic unsigned long atomic_llong _Atomic long long atomic_ullong _Atomic unsigned long long atomic_char16_t _Atomic char16_t atomic_char32_t _Atomic char32_t atomic_wchar_t _Atomic wchar_t atomic_int_least8_t _Atomic int_least8_t atomic_uint_least8_t _Atomic uint_least8_t atomic_int_least16_t _Atomic int_least16_t atomic_uint_least16_t _Atomic uint_least16_t atomic_int_least32_t _Atomic int_least32_t atomic_uint_least32_t _Atomic uint_least32_t atomic_int_least64_t _Atomic int_least64_t atomic_uint_least64_t _Atomic uint_least64_t atomic_int_fast8_t _Atomic int_fast8_t atomic_uint_fast8_t _Atomic uint_fast8_t atomic_int_fast16_t _Atomic int_fast16_t atomic_uint_fast16_t _Atomic uint_fast16_t atomic_int_fast32_t _Atomic int_fast32_t atomic_uint_fast32_t _Atomic uint_fast32_t atomic_int_fast64_t _Atomic int_fast64_t atomic_uint_fast64_t _Atomic uint_fast64_t atomic_intptr_t _Atomic intptr_t atomic_uintptr_t _Atomic uintptr_t atomic_size_t _Atomic size_t atomic_ptrdiff_t _Atomic ptrdiff_t atomic_intmax_t _Atomic intmax_t atomic_uintmax_t _Atomic uintmax_t

---

sequentially consistent accesses are broken in C++ concurrency!:

" Our model supports all features of C++ concurrency except con- sume reads and SC accesses. Consume reads are widely considered a premature aspect of the C++11 standard and are currently im- plemented the same as acquire reads in mainstream compilers. In contrast, SC accesses are a major feature of C++, and originally our model included an account of SC accesses as well. However, in the course of trying to mechanize correctness of compilation to Power (§5.3), we discovered that our semantics of SC accesses was flawed, and this led us to discover a flaw in the C++11 standard as well! (See [ 19 ] for further details.) Thus, a proper handling of SC accesses remains an open and important problem for future work.

19: Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. Repairing sequential consistency in C/C++11. Technical Report MPI-SWS-2016-011, MPI-SWS, November 2016. "

---

https://people.mpi-sws.org/~dreyer/papers/promising/paper.pdf

" In this paper, we present what we believe is a very promising way forward: the first relaxed memory model to support a broad spectrum of features from the C++ concurrency model while also satisfying all three criteria listed in §1.1. We achieve these ends through a combination of mechanisms (some standard, some not), but the most important and novel idea for the reader to take away from this paper is the notion of promises. Under our model, which is defined by an operational semantics, a thread T may nondeterministically “promise” to write a value v to a memory location x at some point in the future. From the point of view of other threads, a promise is no different from an ordinary write: once T has promised to write v to x , other threads can read from that write. (In contrast, T cannot read from its own promised write until T has fulfilled the promise: this is crucial to preserve basic sanity of the semantics.) Intuitively, promises simulate the effect of read-write reorderings by allowing write events to be visible to other threads before the point at which they occur in the program order. We must, however, ensure that promises do not introduce bad OOTA (((out-of-thin-air))) behaviors. Toward this end, we only allow T to promise to write v to x if it is possible to thread-locally certify that the promise can be fulfilled in a finite number of steps. That is, we must show that T will be able to write v to x after some finite sequence of steps of T ’s execution (i.e., with no help from other threads). The certification requirement guarantees absence of bad OOTA executions by ensuring that T can only promise to write a value v to x if T could have written v to x anyway.

...

Our model supports all features of C++ concurrency except con- sume reads and SC accesses. Consume reads are widely considered a premature aspect of the C++11 standard and are currently im- plemented the same as acquire reads in mainstream compilers

(((and SC accesses, although important, are wrong in C++ too, see above secteon)))

...

all the existing implementations of C++, even for weaker architectures like Power and ARM, guarantee at a bare minimum a property we call per-location coherence (aka SC-per-location ). Per-location coherence says that, even though threads may observe writes to different locations in different orders, they must observe writes to the same location in a single total order (called the “modification order” in C++ lingo). In addition to being supported by hardware, per-location coherence is preserved by common compiler optimizations as well. Hence, we want our semantics of relaxed accesses to guarantee it. (In §4.3 we will present an even weaker mode of accesses that does not provide full per-location coherence.)

...

Both Java and C++ fail to achieve some of these criteria. In the case of Java, the memory model fails to validate a number of common program transformations performed by real Java compilers, such as redundant read-after-read elimination and “roach motel” reordering [26]. Although this problem has been known for some time, a satisfactory solution has yet to be developed.

In the case of C++, the memory model relies crucially on undefined behaviors to give semantics to racy programs. Moreover, it permits certain “out-of-thin-air” executions which violate basic invariant-based reasoning (and DRF guarantees) [7]. "

---

How does the memory model in https://people.mpi-sws.org/~dreyer/papers/promising/paper.pdf compare to the WMM and WMM-S memory models given in https://arxiv.org/pdf/1707.05923.pdf ? Are there other research memory models out there that are competitive with these?

---

https://github.com/rmccullagh/como-lang-ng/blob/master/como_opcode.c

's opcodes:

const char * const str_opcodelist[] = { "INONE", "LOAD_CONST", "STORE_NAME", "LOAD_NAME", "IS_LESS_THAN", "JZ", "IPRINT", "IADD", "JMP", "IRETURN", "NOP", "LABEL", "HALT" };

sounds good...

smallvm's:

  1. define OP_SET 0x01 /* Sets a register to a value or to the contents of another register */
  2. define OP_ADD 0x02 /* Adds two values or register contents */
  3. define OP_SUB 0x03 /* Subtracts two values or register contents */
  4. define OP_MULT 0x04 /* Multiplies two values or register contents */
  5. define OP_DIV 0x05 /* Divides two values or register contents */
  6. define OP_MOD 0x06 /* Mods two values or register contents */
  7. define OP_STORE 0x07 /* Stores a value into memory */
  8. define OP_GET 0x08 /* Get a value from memory */
  9. define OP_JMP 0x09 /* Jump to another location in memory */
  10. define OP_IF 0x0A /* Performs the next instruction if the values are equal */
  11. define OP_IFN 0x0B /* Performs the next instruction if the values are not equal */

sounds good...

---

High-Performance Extendable Instruction Set Computing http://researchbank.rmit.edu.au/eserv/rmit:2517/n2001000381.pdf

"

" load/store instructions that reference the Stack Pointer and the Index Register tend to exhibit very different operand lengths. For Stack Pointer use, the offset needs to be more than 5 bits, while the majority (77%) of the Index Register load/store operations can utilize a 3-bit operand "

note: i haven't even skimmed this article yet, i just quickly glanced at it to see what it was about and happenend to see those two quotes

---

---

"

The 6502 is nearly a RISC machine in number of machine cycles per instruction (about 2 average) yet has powerful addressing modes for table look-up-driven real-time software. The indirect, indexed addressing mode has yet to be beat by any RISC machine, which takes too many instructions to do the same thing. " -- [25]

" Indirect-indexed addressing

In this commonly used Addressing mode, the Y Index Register is used as an offset from the given zero page vector. The effective address is calculated as the vector plus the value in Y.

Indirect-indexed addressing is written as follows:

     LDY #$04
     LDA ($02),Y

In the above case, Y is loaded with four (4), and the vector is given as ($02). If zero page memory $02-$03 contains 00 80, then the effective address from the vector ($02) plus the offset (Y) would be $8004.

This addressing mode is commonly used in array addressing, such that the array index is placed in Y and the array base address is stored in zero page as the vector. Typically, the value in Y is calculated as the array element size multiplied by the array index. For single byte-sized array elements (such as character strings), the value in Y is the array index without modification. " [26]

---

kuba Posts: 39 August 2011 edited August 2011 0 In the nutshell: Propeller does have WAITCNT and WAITVID, on XS1 you have one WAIT instruction that you can use to wait for any combination of events from various peripherals. And that's only the beginning.

XS1 has a fairly powerful software-controlled interrupt vectoring system. The interrupt vectors are not permanently assigned to peripherals, like in many MCUs. Instead, you can assign any vector to any event-generating peripheral (I/O port, etc). When the event happens, the vector points to the next instruction to be scheduled for given thread. A vector is specific to a thread, so you have full thread affinity for responding to external events.

The classical problem of what to do if different events all reuse same interrupt handler (vector) is handled very nicely, too. Normally you have to interrogate status bits to know what happened, if the handler could be triggered by different things. On XS1, to each event source you assign a so-called environment vector. It's simply a data word that's available in your interrupt/event handler, and lets you adjust your logic according to interrupt source. You can use it as a bitmask, as a jump offset or table offset, or whatever suits your application. I haven't seen anything like that on any of the mainstream MCUs -- feel free to correct me if I'm wrong. You normally have to emulate this by setting up code to write a value somewhere, then jump to a common handler code. This costs precious cycles. On XS1, an event/interrupt handler can be done in a couple thread cycles -- say in 80ns. That's less than one clock period on some MCUs.

The major difference between XS1 and Propeller (P1 and P2 both) is that Propeller has no interrupt support at all. On XS1, event/interrupt support enables essentially free event-driven switch statements. You can wait on many things to happen, and there's no time penalty for that. Waiting on one event is no different from waiting on 10 events, in terms of latency. Of course if two things happen at the same time, you can't process them concurrently in the same thread, but at least your code doesn't get any slower from trying to wait on many things in the first place. This is IMHO a very sane design decision.

The difference between events and interrupts is fairly simple on XS1. An Event handler does not preserve the PC. You have to be within a WAIT instruction for an event to fire. It is like a hardware-driven switch statement. You have sole control of the execution path after you're done handling an event. An Interrupt does the usual automagical PC/status storage in registers dedicated for that purpose, so there's no memory access overhead for that.

---

one thought i (bayle) had from reading a bunch of other stuff on the forums on Propeller vs. XMOS MCUs:

---

"Some cores have significantly higher performance -- for example, the ARM Cortex-M4 has DSP instructions and usually floating-point, and the Cortex-M7 has cache IIRC.

---

westfw

Quote (((from brucehoul)))

    I've come to the conclusion that 16 bit *instructions* are very much in a sweet spot, either fixed size, or with a way to escape to the occasional 32 bit instruction.

Somewhat agree... " [27]

---

suggests that an 8-level stack may be okay for really low-level stuff (we're apparently talking about an MCU with only 256k of memory though!):

" The HT66 feels quite similar in design to a Microchip PIC16: a 4-cycle single-accumulator RISC architecture, with an 8-level stack for saving the PC address, plus a banking arrangement used to address more than 256 bytes of memory. Unlike the PIC16, there’s a single 128-byte SFR set placed at the bottom of RAM. The remaining 128 bytes of addressable space are split into two banks to cover the 256-byte capacity this part has. The 63-instruction ISA is similar to the PIC16, but also includes bit manipulation instructions. "

---

people in MCUs talk about having vectored interrupts be a big deal. They also want different interrupt priority levels, like around 2 ( https://jaycarlson.net/pf/atmel-microchip-tinyavr-1-series/ ), 3 (AVR XMEGA, see https://jaycarlson.net/pf/atmel-microchip-tinyavr-1-series/ ) , 4 (ARM Cortex-M0 "core has a nested vector interrupt controller, with up to 32 interrupt vectors and 4 interrupt priorities"), 7 ("The PIC24 has a vectored exception system similar to ARM microcontrollers; there’s also a seven-priority interrupt controller with up to 118 interrupt sources") . lots of things have 16 GPRs, eg the PIC24. The PIC24 also has multiply and divide instructions.

---

https://blogs.msdn.microsoft.com/oldnewthing/20060706-12/?p=30623

Is the maximum size of the environment 32K or 64K?

A: Both.

The limit is 32,767 Unicode characters, which equals 65,534 bytes. Call it 32K or 64K as you wish, but make sure you include the units in your statement if it isn't clear from context.

---

"

One other thing to note is that GCC uses the normal convention for function calls: any call-saved registers the function needs will be pushed to the stack by the function and restored before returning. But there’s also a bunch of call-used registers available for user functions to clobber, which makes it easier to write assembly routines, and gives the compiler plenty of room for handling function locals.

This is normal if you come from PC or ARM development, but many MCU architectures and compilers don’t PUSH or POP registers at all; instead, specific registers (or RAM addresses) are set aside for specific functions. The advantage of GCC’s standard calling approach is simplicity, flexibility and the ability to support large projects efficiently — you also get reentrancy for free, which compilers like Keil’s C-51 require you to explicitly request when declaring the function. " [28]

---

" >3. Small general-purpose register file. My vote goes for 8 GP >registers + SP, i.e. one more register than x86.

Madness! :-) Apparently 16 is minimal for some graph coloring heuristics. " -- [29]

but

" In ARM mode, the ARM/Thumb is a RISC-like 32-bit processor with 16 registers. In Thumb mode, a compressed instruction encoding is used, with 16-bit instructions. Most instructions in Thumb mode are two-address, and can only access the first 8 registers." -- [30]

but

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.7342&rep=rep1&type=pdf

should that with less than 12 registers their test algorithms take much more resources

but

i skimmed the first results of https://www.google.com/search?q=%22graph+coloring%22+%2216+registers%22 and didn't see anything about a requirement for 16, in fact many papers talked about both 8 and 16 registers and compared them.

---

so maybe we should have less special-purpose registers and more GPRs. If we did this, we could put the special-purpose registers under an instruction similar to SYSINFO.

Also, we probably want to add a (data) stack frame register in addition to the data stack register. Of course, this could be convention, unless we give it special addressing mode support.

As for addressing modes in Oot (recalling that Oot Boot doesn't support addressing modes), may want to add indirect indexed, which can be useful for referencing arrays. Alternately, instead of assuming a 'zero page' as indirect indexed does, you can just have an addressing mode which adds two registers together and uses the result as the effective addr (one register could hold a pointer and the other could hold an offset). Note that this means you have to split the operand in half. In addition, it might be useful to have an addressing mode which adds a constant to the contents of a register and uses the result as the effective addr; the register could hold a pointer and the constant could be an offset; again we have a split operand. Alternately, the whole operand could be a constant, and the register could implicitly be the stack frame register.

just something to think about. i don't think that commentator was correct that 16 regs are crucial for good graph coloring, and it would be annoying to have to save an opcode or two for a 'pseudoregister' instruction(s).

On the other hand, which is more annoying; having to waste opcodes on pseudoregister instructions, or having some of the registers be 'special', necessitating extra checks in every addressing mode effective address calculation? hmm yeah maybe i'll take the special instructions, please. Otoh this would mean that using the PC as an offset to pointers because awkward. And we do want the stack pointers to be normal registers, right? But what about those addressing modes in Boot that don't accept stack pointers? Arg...

(tentatively added GETSTATE and SETSTATE)

---

https://stackoverflow.com/questions/1518711/how-does-free-know-how-much-to-free?rq=1

it would be nice to have an instruction to check the size of an allocat

(tentatively added MSIZE)

---

some evidence for having:

" It was also designed for C compilers, too — with 32 registers available at all times, compilers can efficiently juggle around many operands concurrently; the 8051, by comparison, has four banks of eight registers that are only easily switched between within interrupt contexts (which is actually quite useful).

And interrupts are one of the weak points of the AVR core: there’s only one interrupt priority, and depending on the ISR, many registers will have to be pushed to the stack and restored upon exit. In my testing, this often added 10 PUSH instructions or more — each taking 2 cycles. "

---

"In addition to the normal CPU registers, Arm cores have 13 general-purpose working registers, which is roughly the sweet spot."

---

" The core has a nested vector interrupt controller, with up to 32 interrupt vectors and 4 interrupt priorities — plenty when compared to the 8-bit competition ... also has full support for runtime exceptions, which isn’t a feature found on 8-bit architectures. "

---

" One of the biggest problems with ARM microcontrollers (((compared to eg 8-bit and 16-bit MCUs))) is their low code density for anything other than 16- and 32-bit math — even those that use the 16-bit Thumb instruction set. This means normal microcontroller type routines — shoving bytes out a communication port, wiggling bits around, performing software ADC conversions, and updating timers — can take a lot of code space on these parts. Exacerbating this problem is the peripherals, which tend to be more complex — I mean “flexible” — than 8-bit parts, often necessitating run-time peripheral libraries and tons of register manipulation.

Another problem with ARM processors is the severe 12-cycle interrupt latency. When coupled with the large number of registers that are saved and restored in the prologue and epilogue of the ISR handlers, these cycles start to add up. ISR latency is one area where a 16 MHz 8-bit part can easily beat a 72 MHz 32-bit Arm microcontroller. ... Because of its small core and fast interrupt architecture, the 8051 architecture is extremely popular for managing peripherals used in real-time high-bandwidth systems, such as USB web cameras and audio DSPs, and is commonly deployed as a house-keeping processor in FPGAs used in audio/video processing and DSP work. "

---

" The STM8 core has six CPU registers: a single accumulator, two index registers, a 24-bit program counter, a 16-bit stack pointer, and a condition register. "

---

" It was also designed for C compilers, too — with 32 registers available at all times, compilers can efficiently juggle around many operands concurrently; the 8051, by comparison, has four banks of eight registers that are only easily switched between within interrupt contexts (which is actually quite useful). "

---

STM8:

" The claim to fame of the core is its comprehensive list of 20 addressing modes, including indexed indirect addressing and stack-pointer-relative modes. There’s three “reaches” for addressing — short (one-byte), long (two-byte), and extended (three-byte) — trading off memory area with performance. "

---

some MCU conclusions from mcuComparisons, after reading [31]:

" if i had to prioritize some of these i'd say:

note: at one point the author says that some criteria for his choices were:

to summarize even shorter:

--- in reply to a Q about RISC-V simulators:

" The QEMU port is out of date, but there's an active effort going on to update it right now

  https://github.com/riscv/riscv-qemu/pull/70

There's a handful of other ISA simulators available for RISC-V, you can build and boot a kernel on Spike (our ISA golden model) by running "make sim" here:

  https://github.com/sifive/freedom-u-sdk

It's the same kernel image that runs on the FPGA and will run on the ASIC based boards. "

---

how few signals/interrupts can we get away with? (we want to minimize the memory devoted to interrupt handler entry point tables, although i suppose these could just be linked lists). Cortex-M0/M0+/M1 limits vendors to 32 interrupts at most. [32]. Here's a table of ~29-~32 POSIX-defined signals: https://en.wikipedia.org/wiki/Signal_(IPC)#POSIX_signals . https://en.wikipedia.org/wiki/Signal_(IPC)#Miscellaneous_signals lists 7 more nonstandard signals. Even for the standard signals, "For most signals the corresponding signal number is implementation-defined". The ones with standard numbers are listed in [33], which lists 7: SIGHUP, SIGINT, SIGQUIT, SIGABRT, SIGKILL, SIGALRM, SIGTERM.

https://www.elprocus.com/types-of-interrupts-in-8051-microcontroller-and-interrupt-programming/ has 5 interrupts [34]: Timer 0 overflow, Timer 1 overflow, External hardware interrupt INT0, External hardware interrupt INT1, Serial communication interrupt.

PIC16F877 has 15 interrupts [35]. PIC micro has 15 interrupts [36].

AT90USB1287 maybe has 4 interrupts? [37] [38]. ATMega8515 maybe has 18? [39]. ATmega328P has 26, ATtiny4313 has 21, ATtiny85 has 14 [40].

MSP430 has about 32 interrupt priorities and 32 defined interrupts? [41].

The demo figure in [42] has 8 interrupt priority levels.

"Newer x86 systems integrate an Advanced Programmable Interrupt Controller (APIC) that conforms to the Intel APIC Architecture. These APICs support a programming interface for up to 255 physical hardware IRQ lines per APIC, with a typical system implementing support for only around 24 total hardware lines."

"There are 256 interrupt vectors on x86 CPUs, numbered from 0 to 255 which act as entry points into the kernel. The number of interrupt vectors or entry points supported by a CPU differs based on the CPU architecture."

" Common practice is to leave the first 32 vectors for exceptions, as mandated by Intel. However you partition the rest of the vectors is up to you. "

" There are actually two PICs on most systems, and each has 8 different inputs, plus one output signal that's used to tell the CPU that an IRQ occurred. "

linux man 7 signal: " First the signals described in the original POSIX.1-1990 standard.

       Signal     Value     Action   Comment
       ──────────────────────────────────────────────────────────────────────
       SIGHUP        1       Term    Hangup detected on controlling terminal
                                     or death of controlling process
       SIGINT        2       Term    Interrupt from keyboard
       SIGQUIT       3       Core    Quit from keyboard
       SIGILL        4       Core    Illegal Instruction
       SIGABRT       6       Core    Abort signal from abort(3)
       SIGFPE        8       Core    Floating-point exception
       SIGKILL       9       Term    Kill signal
       SIGSEGV      11       Core    Invalid memory reference
       SIGPIPE      13       Term    Broken pipe: write to pipe with no
                                     readers; see pipe(7)
       SIGALRM      14       Term    Timer signal from alarm(2)
       SIGTERM      15       Term    Termination signal
       SIGUSR1   30,10,16    Term    User-defined signal 1
       SIGUSR2   31,12,17    Term    User-defined signal 2
       SIGCHLD   20,17,18    Ign     Child stopped or terminated
       SIGCONT   19,18,25    Cont    Continue if stopped
       SIGSTOP   17,19,23    Stop    Stop process
       SIGTSTP   18,20,24    Stop    Stop typed at terminal
       SIGTTIN   21,21,26    Stop    Terminal input for background process
       SIGTTOU   22,22,27    Stop    Terminal output for background process
       The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.
       Next the signals not in the POSIX.1-1990 standard but described in SUSv2 and POSIX.1-2001.
       Signal       Value     Action   Comment
       ────────────────────────────────────────────────────────────────────
       SIGBUS      10,7,10     Core    Bus error (bad memory access)
       SIGPOLL                 Term    Pollable event (Sys V).
                                       Synonym for SIGIO
       SIGPROF     27,27,29    Term    Profiling timer expired
       SIGSYS      12,31,12    Core    Bad system call (SVr4);
                                       see also seccomp(2)
       SIGTRAP        5        Core    Trace/breakpoint trap
       SIGURG      16,23,21    Ign     Urgent condition on socket (4.2BSD)
       SIGVTALRM   26,26,28    Term    Virtual alarm clock (4.2BSD)
       SIGXCPU     24,24,30    Core    CPU time limit exceeded (4.2BSD);
                                       see setrlimit(2)
       SIGXFSZ     25,25,31    Core    File size limit exceeded (4.2BSD);
                                       see setrlimit(2)...
       Next various other signals.
       Signal       Value     Action   Comment
       ────────────────────────────────────────────────────────────────────
       SIGIOT         6        Core    IOT trap. A synonym for SIGABRT
       SIGEMT       7,-,7      Term    Emulator trap
       SIGSTKFLT    -,16,-     Term    Stack fault on coprocessor (unused)
       SIGIO       23,29,22    Term    I/O now possible (4.2BSD)
       SIGCLD       -,-,18     Ign     A synonym for SIGCHLD
       SIGPWR      29,30,19    Term    Power failure (System V)
       SIGINFO      29,-,-             A synonym for SIGPWR
       SIGLOST      -,-,-      Term    File lock lost (unused)
       SIGWINCH    28,28,20    Ign     Window resize signal (4.3BSD, Sun)
       SIGUNUSED    -,31,-     Core    Synonymous with SIGSYS"

so looks like they have about 32 'value' numbers available. But also:

" Real-time signals Starting with version 2.2, Linux supports real-time signals as originally defined in the POSIX.1b real-time extensions (and now included in POSIX.1-2001). The range of supported real-time sig‐ nals is defined by the macros SIGRTMIN and SIGRTMAX. POSIX.1-2001 requires that an implementa‐ tion support at least _POSIX_RTSIG_MAX (8) real-time signals.

       The Linux kernel supports a range of 33 different real-time signals, numbered 32 to 64."

see also http://www.gnu.org/software/libc/manual/html_node/Standard-Signals.html

Note that thread descriptors in Pthreads probably don't all have to have such a table of entry points: " It is not possible to install "per-thread" signal handlers.

From man 7 signal (emphasis by me):

    The signal disposition is a per-process attribute: in a multithreaded application, the disposition of a particular signal is the same for all threads."

This suggests that we might want to start with a limit of between 8-32 signals (inclusive).

I'm leaning towards 16 (b/c it's in the middle of that range) or 32 (because it accomodates all AVRs and Cortex M0 ARMs).

16 interrupts requires 1 16-bit word to individually mask. If there are 16 priority levels (4 bits), that requires a further 4 words to set the priority levels (unless we just say that the interrupt number also determines its priority). Plus 16 words for the entry points. For a total of 21 words or so.

32 interrupts requires 2 16-bit words to individually mask. If there are 16 priority levels (4 bits), that requires a further 8 words to set the priority levels. Plus 32 words for the entry points. For a total of 42 words or so.

A cache line in some desktop processors is 64 bytes.

What's the conservative thing to do here? More interrupts wastes (about 21 words of) memory in every process. But too few is probably a more serious problem, akin to running out of address space by using a low bitwidth. So maybe 32 is more conservative. Otoh the cost of having too few interrupts is probably just to software multiplex the additional hardware interrupt onto a catch-all VM interrupt, which increases the latency of dealing with interrupts being multiplexed but doesn't affect the others -- and we're not exactly targeting bare-metal real-time MCU control applications here.

I guess the practical thing to do is 32. But i really want to do 16, because i hate wasting 32 words for the entry points.

But let's be practical. 32.

One thing to consider is that, instead of fixing a memory layout, if we have separate 'commands' or even opcodes to tell the VM to alter signal #x, then we don't have to prespecify the number of signals. But commands would have to be provided to set the masking and the priority as well as the entry point. Also remember that masking all (except non-maskable) signals should be quick.

This can all go in BootX?.

..i dunno man, are we really trying to make a structure for signals that allows us to efficiently map all external signals directly into Oot? Or are we just trying to have our own little mini intra-Oot signals? In the latter case, we could not have separate opcodes, and have just 16 signals, and have a fixed memory map, which may be slightly simpler. Mapping external signals into Oot could be the work of further libraries.

If we just want an intra-oot signal-like-thing, we could have 16 signal entry points, 2 priorities, so one word for masking, 16 words for entry points, one word for priorities, and one word for which signals are currently pending, for a total of 19 words.

otoh we like generality, so maybe the opcode thing is worth it.

i guess the questions are:

if we want generality, we probably want opcodes for:

Alternately, we want to define a 'signal descriptor' struct and then have an opcode to fetch the address of a signal descriptor.

Maybe we should lump this all into lumpy SETMODE and GETMODE opcodes, which could take many arguments on the stack. Hmm, that seems pretty attractive.

However, another argument for a hard limit on # of signals is to promote interoperability; otherwise some Oot programs may assume that, say, 64 signals are available, while some Oot implementations will only offer 16.

Another issue with all this is that most platforms probably won't support nested signals with priorities and subpriorities, so guaranteeing this forces a lot more work on BootX? implementors. Maybe simpler to forget about nesting and priorities. Some platforms may not even support individually masking certain signals.

Otoh we could always just provide all this and then allow some implementations to not support it (eg not support signals at all), or not support parts of it (eg support signals and global signal masking, but not local signal masking, priorities, or nested signals).

Also, we could seperate the 'internal' signal numbers from 'external' ones, and then provide some way to wire them up, eg to say "when external interrupt 0x80 occurs, that corresponds to signal 33 in BootX?'.

Currently i'm leaning towards saying that signal support is optional, but if supported, at least 16 signals must be supported, and using GETMODE and SETMODE.

Of course, signals must have data attached.

And/or, we could unify signals and channels. So upon receiving data on some channels, the data just sits there until you check it (unless you are waiting for it via 'poll'), but upon receiving data on other channels, it asynchronously interrupts you and transfers control to the signal handler. Hmm, that sounds intriguing... so are GETMODE and SETMODE now operating on individual channels/fids/IO-addresses? Are they like fstat? What about global 'modes', like masking all maskable signals? Is this a GET/SETMODE on '0'? Can we/should we unify GETMODE and SYSINFO -- mb SYSINFO is GETMODE on channel 0, and a compiler that wants to statically detect a SYSINFO must use constant propagation to staticly detect that the channel is 0 -- wait no we have a zero register, no constant propagation needed.

I like this. Yes, unify signals and channels. Yes, use GET/SETMODE. Yet, SYSINFO = GETMODE 0. Yes, signals are optional. Yes, allow the implementation to request binding 'external' signals to 'internal' ones -- or just to prebind well-known signal numbers in a certain way (reserve some signal numbers for this). Yes, let the implementation determine the number of signals. If GETMODE is two operand, (register direct, immediate), then SYSINFO k == GETMODE R0 k --- otoh we may want to allow GETMODE to take a third operand from SMALLSTACK -- but if we did that it gets harder to statically analyze, so mb not -- 16 is already a lot.

---