Bayle Shanks's website: proj-oot-ootAssemblyNotes18

---

should probably read at least

https://en.wikipedia.org/wiki/RISC-V

---

for OSs, hypervisors, etc, some thoughts:

" the ecall instruction, the common instruction used to call the next most privileged layer in the processor. For example, an ecall executed at the User layer will likely be handled at the Supervisor layer. An ecall executed at the Supervisor layer will be handled by the Machine layer. (This is a simplistic explanation that can get more complex with trap redirection, but we won't dive into those waters at this moment).

So, when the Supervisor (Linux kernel) executes ecall, the Machine layer's trap handler is executed in Machine mode. The code can be found in the riscv-pk at trap 9, the mcall_trap function, in machine/mtrap.c. ... The RISC-V privilege model was initially designed as an ecosystem that consists of four separate layers of privilege: User, Supervisor, Hypervisor, and Machine. The User privilege layer is, of course, the least privileged layer, where common applications are executed. Supervisor is the privilege layer where the operating system kernel (such as Linux, mach, or Amoeba) lives. The Hypervisor layer was intended to be the layer at which control subsystems for virtualization would live, but has been deprecated in more recent versions of the privilege specification. ... Each privilege layer is presumed to be isolated from all lower privileged layers during code execution, as one would expect. The CPU itself ensures that registers attributed to a specific privilege layer cannot be accessed from a less privileged layer. Thus, as a policy, Supervisor layer code can never access Machine layer registers. This segmentation helps guarantee that the state of each layer cannot be altered by lower privileged layers.

However, the original privilege specification defined memory protection in two separate places. First, the mstatus register's VM field defines what memory protection model shall be used during code execution. "

-- http://blog.securitymouse.com/2017/04/the-risc-v-files-supervisor-machine.html

to return after an ecall, "sret (supervisor exception return)"

---

where are my notes on that Princeton 'tricheck' paper by Trippel, Manerkar, Lustig, Pellauer, Martonosi, that claimed to find problems in RISC-V's memory consistency meodel/memory ordering/concurrency and that got some press around early 2017?

http://mrmgroup.cs.princeton.edu/papers/ctrippel_ASPLOS17.pdf

i recall skimming some portions of that paper, surely i took notes somewhere, particularly on their recommendations (search the PDF above for 'recommend'; also section 5.1.3 contains a recommendation without using that word) (the following are quotes):

require the preservation of dependency orderings in the ISA memory model
have cumulative heavyweight fences
make release operations in the RISC-V ISA cumulative, requiring that accesses before a release in program order and writes observed by the releasing core before the release be made visible before the release is made visible
modify the ISA to dictate that a release need only synchronize with respect to a core when it is observed by an acquire operation from that core
require program order to be preserved between two loads to the same address
decouple the store atomicity setting of an AMO from its acquire and release semantics, allowing AMOs to be store atomic when only having acquire or release semantics (((this one is just for efficiency)))

anyways, in case i DIDN'T take notes, there's the notes.

what they are doing about it (forming a task group to revise the memory consistency model): https://riscv.org/2017/04/risc-v-memory-consistency-model/

may 10 talk: Status of the RISC-V Memory Consistency Model https://riscv.org/2017/05/6th-risc-v-workshop-proceedings/ https://riscv.org/wp-content/uploads/2017/05/Wed1000am-Memory-Model-Lustig.pdf https://youtu.be/E5s54AVGV2E

mailing list search: https://groups.google.com/a/groups.riscv.org/forum/#!searchin/isa-dev/memory$20consistency$20model

their task group formation announcement on the mailing list, with details about design choices to be considered:

https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/Oxm_IvfYItY/discussion

google search: https://www.google.com/search?q=risc-v+Memory+consistency+Model&safe=active&client=ubuntu&channel=fs&source=lnt&tbs=cdr%3A1%2Ccd_min%3Amay+2017%2Ccd_max%3A&tbm=

the task group (committee) issued a Memory Consistency Model Addendum 2.2 saying what to do in the meantime to be conservative:

https://docs.google.com/viewer?a=v&pid=forums&srcid=MDQwMTcyODgwMjc3MjQxMjA0NzcBMDUwNzQ0NzcxMjczNjI2NzQwNDEBczVCLTc5VWtCd0FKATAuMQFncm91cHMucmlzY3Yub3JnAXYy mailing list discussion: https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/-p9ch4V9bKM/discussion

the above searches were done on Oct 4 2017.

also, the May talk looks like a good intro to what sorts of things the issues are:

https://riscv.org/wp-content/uploads/2017/05/Wed1000am-Memory-Model-Lustig.pdf https://www.google.com/search?q=risc-v+Memory+consistency+Model&safe=active&client=ubuntu&channel=fs&source=lnt&tbs=cdr%3A1%2Ccd_min%3A5%2F28%2F2017%2Ccd_max%3A&tbm=

" There are ongoing efforts to specify memory models for multi- threaded programming in C, C++ [57] and other languages. These efforts are influenced by the type of memory models that can be supported efficiently on existing architectures like x86, POWER and ARM. While the memory model for x86 [46, 51, 54] is cap- tured succinctly by the Total Store Order (TSO) model, the models for POWER [52] and ARM [25] are considerably more complex. The formal specifications of the POWER and ARM models have required exposing microarchitectural details like speculative exe- cution, instruction reordering and the state of partially executed in- structions, which, in the past, have always been hidden from the user. ... SC [38] is the most intuitive memory model, but naive implemen- tations of SC suffer from poor performance. ... Instead the manufactures and researchers have chosen to present weaker memory model interfaces, e.g. TSO [58], PSO [61], RMO [61], x86 [46, 51, 54], Processor Consistency [30], Weak Consis- tency [24], RC [27], CRF [55], POWER [33] and ARM [9]. The tutorials by Adve et al. [1] and by Maranget et al. [44] provide re- lationships among some of these models. The lack of clarity in the definitions of POWER and ARM mem- ory models in their respective company documents has led some researchers to empirically determine allowed/disallowed behaviors [8, 25, 41, 52]. Based on such observations, in the last several years, both axiomatic models and operational models have been devel- oped which are compatible with each other [3–5, 7, 8, 25, 41, 52, 53]. However, these models are quite complicated; for example, the POWER axiomatic model has 10 relations, 4 types of events per in- struction, and 13 complex axioms [41], some of which have been added over time to explain specific behaviors [4–6, 41]. The ab- stract machines used to describe POWER and ARM operationally are also quite complicated, because they require the user to think in terms of partially executed instructions [52, 53]. ... Adve et al. defined Data-Race-Free-0 (DRF0), a class of pro- grams where shared variables are protected by locks, and proposed that DRF0 programs should behave as SC [2]. Marino et al. im- proves DRF0 to the DRFx model, which throws an exception when a data race is detected at runtime [45]. However, we believe that architectural memory models must define clear behaviors for all programs, and even throwing exceptions is not satisfactory enough.

A large amount of research has also been devoted to specifying the memory models of high-level languages, e.g. C/C++ [12–15, 17, 34, 35, 37, 49, 57] and Java [18, 20, 23, 42, 43]. There are also proposals not tied to any specific language [19, 22]. This remains an active area of research because a widely accepted memory model for high-level parallel programming is yet to emerge, while this paper focuses on the memory models of underlying hardware "

-- An Operational Framework for Specifying Memory Models using Instantaneous Instruction Execution

---

from [1] :

hierarchy of common memory consistency model strengths, from strongest to weakest:

sequential consistency (SQ)
TSO (x86)
RISC-V
Power, ARM

a "woefully incomplete" characterization of these memory consistency models: consider reorderings of the following instruction pairs: load/load, load/store, store/load, store/store:

SQ: forbid all load/store reorderings
TSO: allow 'store then load' reorderings, forbid others
RISC-V: all allowed, except still debating whether to allow 'load then store' reorderings. (((note: load-store can lead to an acausal paradox, see notes in plPartConcurrency)))
Power, ARM: all allowed

SEQUENTIAL CONSISTENCY:

" 1. All threads are interleaved into a single “thread” 2. The interleaved thread respects each thread’s original instruction ordering (“program order”) 3. Loads return the value of the most recent store to the same address, according to the interleaving

...

For performance, most processors weaken rule #2, and most weaken #1 as well.

...

Q: Can I think of an execution as an interleaving of the instructions in each thread (in some order)? ... That would make it illegal to forward values from a store buffer!.. Because with a store buffer, cores can read their own writes “early”.

Option 1: forbid store buffer forwarding, keep a simpler memory model, sacrifice performance

Option 2: change the memory model to allow store buffer forwarding, at the cost of a more complex model

Nearly all processors today choose #2

Q: Can I think of an execution as an interleaving of the instructions in each thread (in some order), with an exception for store buffer forwarding?

Yes, on x86 and ARMv8.2...x86 and ARMv8.2 are "(other-/weak-)multi-copy atomic"
No, on IBM Power and GPUs...IBM Power and GPUs are not multi-copy atomic " -- [2]

example of the exception for store buffer forwarding:

2 CPUs. Each CPU has 2 threads: threads 1 and 2 on CPU 1, threads 3 and 4 on CPU 2. Thread 1 (on CPU 1) stores a value to memory location A, then Thread 2 reads from memory location A. Starting at about the same time, Thread 3 (on CPU 2) stores a different value to memory location A, then Thread 4 reads from memory location A. If the time it takes for these stores to propagate between CPUs is short, Thread 2 perceives an ordering on which Thread 1's store came before Thread 3's store, but Thread 4 perceives the opposite ordering. So, in this case, there is no interleaving perceived by all threads.

---

" Memory Model Landscape

Sequential Consistency (SC)

Easy to understand and formalize; no fences
All parallel programming is built on SC foundations
No ISA supports it exclusively

Total Store Order (TSO)

Loads can jump over stores; operationally can be explained in terms of Store buffers
Easy to understand and formalize; one fence
Intel ISA supports it
lots of legacy code

Weaker memory models

RMO, RC, Alpha, POWER, ARM, ...
No two models agree with each other
Experts don’t agree on definitions

...

Architects find SC & TSO constraining

(((but))) Programmers hate weak memory models (((because)))...

Difficult to understand, implementation-driven weak memory models ARM, POWER, RMO, Alpha, etc....

Extra fences => bad performance
Too few =? errors (often latent); undesirable behaviors
Automatic insertion of minimal number of fences is impossible

" -- [3]

---

it may be useful to look at what was debated in the RISC-V memory consistency model task group, with the heuristic that these items are the 'unsolved' questions in the field, eg things that RISC-V may get wrong, eg complexities to try to stay away from in OVM:

" Items on the agenda currently include, in rough priority order:

...

Decide whether global store (multi-copy) atomicity is beneficial/acceptable/unacceptable. This fundamentally affects the design of the rest of the model, and hence will be the top technical priority to start
Options for forbidding out-of-thin-air executions, and in particular: should load-store reordering be forbidden?
The legality of same-address load-load reordering, whether memory ordering needs to respect (address/data/control) dependencies, and other details of a similar vein
Longer-term: ensure compatibility with the virtual memory subsystem (e.g., FENCE.I), and take other potential risk factors (e.g., mixing accesses of different sizes) into consideration " -- [4]

" PENDING/POSSIBLE CHANGES TO THE MODEL

Feature Status Multi-copy atomicity: Major debate!

Enforce same-address ordering (including load-load pairs): Required! (((?? but see subsequent May 13 mailing list post https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/-p9ch4V9bKM/Ah1Jb_9-BQAJ "More than likely we won't require a fence between two loads if the second is address-dependent on the first...(((but))) I don't think we're quite ready to commit to anything absolute yet.")))

Forbid load-store reordering (for accesses to different addresses), Enforce ordering of address/control/data-dependent instructions, Which FENCE types? (.pr, .pw, .sr, .sw? Other?): Still sorting out the details!

IT’S? ALWAYS SAFE TO BE CONSERVATIVE

If your architecture is simple and conservative...it will be compliant with any model we’ll use
e.g., if the model chooses to allow non-atomicity, your implementation can still safely be multi-copy atomic
e.g., if you want to ignore the .pr, .pw, .sr, and .swfence bits, and just always do a full fence, that’s fine too " -- [5]

" More than likely we won't require a fence between two loads if the second is address-dependent on the first. Practically speaking, a lot of software (notably Linux) basically assumes that hardware always guarantees this to work, because no major architecture since Alpha has relaxed such orderings.

However, we don't yet have 100% consensus on how exactly to formalize this in the task group, and whether address, control, and data dependencies are all equivalent in strength, whether they apply equally well to read-read vs. read-write orderings, or even whether all of the above should even be enforced. So I don't think we're quite ready to commit to anything absolute yet. " -- [6]

RISC-V: all allowed, except still debating whether to allow 'load then store' reorderings. " -- [7]

Hardware implementers should err on the side of caution by assuming that RISC-V may adopt a memory model as strong as Total Store Ordering (TSO). In particular:

– Architects should pay careful attention to agressive memory access reordering, aggressive

cache cache coherence protocols, and designs that share store buffers between threads. – Hardware should respect all same-address orderings (including load-load pairs) and any orderings established by address, control, and data dependencies.

Assembly programmers should err on the side of caution and assume that RISC-V may adopt a weakly ordered memory model. We recommend using a full fence instruction where the corresponding code on other weakly ordered architectures employs any fence.
Compiler writers should for now continue to use the intuitive mappings from language-level memory ordering to RISC-V operations. In particular,

C/C++ Construct, Base ISA Mapping, ‘A’ Extension Mapping

Non-atomic Load...

atomic load(memory order consume) ld; fence r,rw atomic load(memory order acquire) ld; fence r,rw atomic load(memory order seq cst) fence rw,rw; ld; fence r,rw

Non-atomic Store...

atomic store(memory order relaxed) sd atomic store(memory order release) fence rw,w; sd amoswap.rl atomic store(memory order seq cst) fence rw,rw; sd fence rw,rw; amoswap

Fences

atomic thread fence(memory order acquire) fence r,rw atomic thread fence(memory order release) fence rw,w atomic thread fence(memory order acq rel) fence rw,rw atomic thread fence(memory order seq cst) fence rw,rw

Furthermore, we recommend compiler writers avoid fences weaker than fence r,rw, fence rw, w, and fence rw, rw until the memory model clarifies their semantics. Additionally, while AMOs with both the aq and rl bits set do imply both aq and rl semantics, we recommend against their use until the memory model clarifies their combined semantics.

" -- RISC-V Memory Consistency Model Addendum 2.2

" Weak memory models: Technical issues

Atomic vs Non-Atomic memory subsystems
Should Load-Store reordering, i.e., a store is allowed to be issued to memory before previous loads have completed, be permitted?
Which same address dependencies must be enforced?
- Load a ; Load a ;
- Store a, Load a ;
How many different fences should be supported?
- Different fences can have different performance implications " -- [8]

Atomic memory systems

(((is the following an example of an atomic memory system, or an example of a NON-atomic memory system?)))

Add a request to rb
Later process the oldest request for any address on any port

Consensus: RISC-V memory model definition will rely only on atomic memory " -- [9]

from https://www.bsc.es/sites/default/files/public/u1810/arvind_0.pdf :

"Example: Ld-St Reordering Permitting a store to be issued to the memory before previous loads have completed, allows load values to be affected by future stores in the same thread" For example,

Process 1: r1 = Load(a) Store(b,1)

Process 2: r2 = Load(b) Store(a,r2)

Load-store reordering would allow the '1' stored by Process 1 into b to be loaded into r2 by process 2's load, and then stored into a by process 2's store, and then loaded into r1 from a by process 1! Implementation-wise, here is what could happen:

Load a misses in local cache
Store a is written to memory
Load b reads the latest value
Store a is written to memory
Load a reads the latest value "

" Load-Store Reordering

Nvidia says it cannot do without Ld-St reordering Although IBM POWER memory model allows this behavior, the server-end POWER processors do not perform this reordering for reliability, availability and serviceability (RAS) reasons MIT opposes the idea because it complicates both the operational and axiomatic definitions, and MIT estimates no performance penalty in disallowing Ld-St reordering

Nevertheless MIT has worked diligently to come up with a model that allows Ld-St ordering (((perhaps e meant REordering? also, MIT's primary proposed model, WMM, prohibits load-store reordering, so note that they are talking about their 'Model X' here, which is detailed later in the slides))) " -- [10]

" C++ operations, WMM instructions Non-atomic Load / Load Relaxed: Ld Load Consumed / Load Acquire: Ld; Reconcile Load SC: Commit; Reconcile; Ld; Reconcile Non-atomic Store / Store Relaxed: St Store Released /Store SC: Commit; St

Compilation from C++11 to WMM C++11 introduces atomic variables in addition to the ordinary (non-atomic) ones

Non-atomic variables are accessed by non-atomic Ld/St
Atomic variables can be accessed by Ld/St with different semantics (e.g. load acquire and store release) "

" RISC-V memory model debate is not settled; in spite of lot of research by the Memory Model Committee (Chair Dan Lustig), the community may vote for TSO "

---

figure 1 from [11], broken into parts, and with some details and notes omitted:

Operational model, Axiomatic model:

SC: Simple, Simple
TSO: Simple, Simple
RMO: Doesn’t exist, Simple; needs fix
Alpha: Doesn’t exist, Medium
RC: Doesn’t exist, Medium (((note: Release Consistency (RC) are often mixed with the concept of “SC for data-race-free (DRF) programs” [28]. It should be noted that “SC for DRF” is inadequate for an ISA memory model, which must specify behaviors of all programs. The original RC definition [6] attempts to specify all program behaviors, and are more complex and subtle than the “SC for DRF” concept.")))
ARM and POWER: Complex, Complex
WMM: Simple, Simple
WMM-S: Medium, Doesn't exist

this suggests that we should restrict our attention to:

SC, TSO, WMM

which are the ones which have both simple operational and simple axiomatic models

"reasoning (((about))) partially executed instructions...is unavoidable for ARM and POWER operational definitions." [12]

the rest of figure 1 from [13], for these rows only, with some details and notes omitted:

Store atomicity, Allow shared write-through cache/shared store buffer, Instruction reorderings, Ordering of data-dependent loads:

SC: Single-copy atomic, No, None, Yes
TSO: Multi-copy atomic, No, Only St-Ld reordering, Yes
WMM: Multi-copy atomic, No, All except Ld-St reordering, No

(note: this group's alternative proposal, WMM-S, for which the operational model complexity was 'medium' and no axiomatic model was provided, has non-atomic store atomicity; i think that one point that the authors may be trying to make is that you want at least multi-copy atomicity for clean semantics; in the paper's conclusion they say "Since there is no obvious evidence that restricting to multi-copy atomic stores affects performance or increases hardware complexity, RISC-V should adopt WMM in favor of simplicity.". Elsewhere in the table, in row 'ARM and POWER', store atomicity is classified as 'Non-atomic', although note that [14] says that ARMv8.2 is "(other-/weak-)multi-copy atomic" as opposed to POWER and GPU which are "not multi-copy atomic"). Indeed [15] says "The...manuals for ARMv7 and early ARMv8 described a relaxed memory model, with programmer-visible out-of-order and speculative execution, that was non-multicopy-atomic... The ARMv8 architecture has therefore been revised: it now has a multicopy-atomic model."

multi-copy atomicity is defined here:

" In this paper we propose two weak memory models for RISC-V: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. The difference between the two models is regarding store atomicity, which is often classified into the following three types [19]:

Single-copy atomic: a store becomes visible to all processors at the same time, e.g., in SC.
Multi-copy atomic: a store becomes visible to the issuing processor before it is advertised simultaneously to all other processors, e.g., in TSO and Alpha [5].
Non-atomic (or non-multi-copy-atomic): a store becomes visible to different processors at different times, e.g., in POWER and ARM. " -- [16]

The abstract of [17] defines "non-multicopy-atomic" as "writes could become visible to some other threads before becoming visible to all".

So this suggests that we should restrict our attention to:

SC, TSO, WMM

which have the common characteristics of at least the following model strengths:

Multi-copy atomicity
Disallow shared write-through cache/shared store buffer
Disallow load-store reordering

and of which at least one of which permits the following model weaknesses:

Multi-copy atomicity (so, the processor which issued a store can see that store before other processors, ie there can be store buffers)
all reorderings except for load-store reordering
reordering of data-dependent loads is allowed

---

more background on WMM; the paper says

" The memory model for RISC-V, a newly developed open source ISA, has not been finalized yet and thus, offers an opportunity to evaluate existing memory models. We believe RISC-V should not adopt the memory models of POWER or ARM, because their axiomatic and operational definitions are too complicated. We propose two new weak memory models: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. Both allow all instruction reorderings except overtaking of loads by a store. We show that this restriction has little impact on performance and it considerably simplifies operational definitions. It also rules out the out-of-thin-air problem that plagues many definitions. WMM is simple (it is similar to the Alpha memory model), but it disallows behaviors arising due to shared store buffers and shared write-through caches (which are seen in POWER processors). WMM-S, on the other hand, is more complex and allows these behaviors. " -- https://arxiv.org/pdf/1707.05923.pdf

---

Instantaneous Instruction Execution (I2E) formalism/model invented by MIT group working on RISC-V memory consistency model for formalization of memory consistency models

Instructions execute in-order and instantaneously; processor state is always up-to-date
Monolithic memory processes loads and stores instantaneously
Data moves between processors and memory asynchronously according to some background rules

SC in I2E:

Pick a processor, execute its current instruction instantaneously and update the register state
A Load reads the memory instantaneously
A Store updates the memory instantaneously Dijkstra 1966, Lamport 1973. SC allows no reordering of instructions.

TSO in I2E:

Simple and vendor-independent

A store first goes into the Store buffer (SB)
A load reads the youngest corresponding entry from SB before reading the memory
A store is dequeued from the SB in FIFO order to update the monolithic memory (background rule) (((if the FIFO order is per-address, this is called PSO instead of TSO)))
A commit fence stalls local execution until SB is empty

TSO allows loads to overtake stores " -- [18]

---

so summary of the last few sections:

do a search for https://www.google.com/search?q=risc-v+Memory+consistency+Model&safe=active&client=ubuntu&hs=z3v&channel=fs&source=lnt&tbs=cdr%3A1%2Ccd_min%3A10%2F1%2F2017%2Ccd_max%3A&tbm= to see what else the RISC-V memory consistency model task group comes up with
consider sequential consistency (SC)
consider TSO (used by x86 and SPARC; relatively strong but weaker than SC)
consider WMM (a proposal that a RISC-V memory consistency model task group research lab came up with. WMM is weaker than TSO but stronger than Power and ARM. The downside is that it was just invented, so who knows what else is wrong with it.
TSO and WMM both allow store-load reorderings. This is probably crucial because it a formal way of expressing permission for store-buffer forwarding, which is when one thread writes to memory (which is cached into a store buffer), and then another thread on the same CPU reads that memory (and is served from the store buffer) before the write hits main memory (ie before other CPUs see the write)
SC allows you to 'think of an execution as an interleaving of the instructions in each thread (in some order)'
TSO allows you to 'think of an execution as an interleaving of the instructions in each thread (in some order), with an exception for store buffer forwarding?
don't go as weak as Power and ARM, they have complex operational and axiomatic semantics, which lend themselves to weird problems like allowing out-of-thin-air results
all of SC, TSO, WMM have the following in common:
- require multi-copy atomicity, which means that, although it may be permitted for a processor to read its own writes before other processors can read them, all other processors get the write at the same time (ordering-wise, that is; not wall-clock-time)
- Disallow shared write-through cache/shared store buffer
- Disallow load-store reordering (this disallows an acausal paradox and i think it also disallows out-of-thin-air results)
RISC-V will probably allow store-load, load-load, and store-store reorderings (load-store is under debate; WMM disallows load-store), although TSO (which disallows load-load and store-store) is still under consideration. Which, if any, dependencies (address, data, control) will be ordered when there is store-load, load-load, and store-store reordering is under debate. Same-address load-load dependencies will probably be ordered but this is under debate.
i think the June slides [19] suggested that consensus had been reached for at least multi-copy atomic? The word 'multi-copy' doesn't appear in those slides, but that's my interpretation of the slide 'Atomic memory systems'.
in terms of advice for compiler-writers, ppl speak in terms of the following of the new C memory model's operations:
- relaxed load, relaxed store, consume load, acquire load (some models have the same instruction form consume load or acquire load), release store, seq cst load, seq cst store, acquire fence, release fence, acq rel fence, seq cst fence
RISC-V 'fences weaker than fence r,rw, fence rw, w, and fence rw,rw' are at least temporarily deprecated, and the semantics of AMO instructions with both the aq and rl bits set are also considered unclear
the tricheck paper had the following recommendations, which i'm guessing are now obsolete:
- require the preservation of dependency orderings in the ISA memory model
- have cumulative heavyweight fences
- make release operations in the RISC-V ISA cumulative, requiring that accesses before a release in program order and writes observed by the releasing core before the release be made visible before the release is made visible
- modify the ISA to dictate that a release need only synchronize with respect to a core when it is observed by an acquire operation from that core
- require program order to be preserved between two loads to the same address
- decouple the store atomicity setting of an AMO from its acquire and release semantics, allowing AMOs to be store atomic when only having acquire or release semantics (((this one is just for efficiency, i think?)))

---

to further summarize the previous summary:

do a search for https://www.google.com/search?q=risc-v+Memory+consistency+Model&safe=active&client=ubuntu&hs=z3v&channel=fs&source=lnt&tbs=cdr%3A1%2Ccd_min%3A10%2F1%2F2017%2Ccd_max%3A&tbm= to see what else the RISC-V memory consistency model task group comes up with

memory ordering consistency models to think about appear to include:

SC, TSO, and maybe WMM

All of SC, TSO, and WMM have the following in common:

require multi-copy atomicity, although it may be permitted for a processor to read its own writes before other processors can read them, all other processors get the write at the same time (ordering-wise, that is; not wall-clock-time)
Disallow load-store reordering (this disallows an acausal paradox and i think it also disallows out-of-thin-air results)

Unless we go with SC only, most of the other CPU-style memory ordering consistency models seem to permit store-load reordering. Whether any dependencies are ordered varies between models, and so perhaps should be assumed to be weak.

Yes, a seq cst fence instruction is probably useful. Yes, a seq cst load and a seq cst store are probably useful.

---

later updates on the RISC-V memory ordering consistency model:

171128 status update by Dan Lustig and the Memory Model TG

https://content.riscv.org/wp-content/uploads/2017/12/Tue0954-RISC-V_Memory_Model-Lustig.pdf

the debate was between: Strong Models (e.g., x86-TSO) and Weak Models (e.g., ARM, IBM Power); initial proposals narrowed down to a strong one, RVTSO (RISC-V TSO, similar to SPARC, x86) and a weak one, RVWMO (RISC-V Weak Memory Ordering, similar to ARMv8); Both are multi-copy atomic, so both are simpler than IBM Power and ARMv7.

The decision was to adopt RVWMO (and to offer TSO as an option, 'Ztso'); toolchain like Linux, gcc, bintools will target RVWMO.

ld.rl and sd.aq are deprecated. ld.aqrl and sd.aqrl means RCsc ((release consistency with sequential consistency)), not fully fenced.

in both RVWMO and WVTSO we have:

Each load returns value from most recent store to same address
No store from another hart (hart = hardware thread) can interrupt an AMO or LR/SC

RVWMO RULES IN A NUTSHELL

Other than the above, A guaranteed to happen before B only if one of:

A and B are the same address, and B is a store
there is a fence in between, with pr/pw/sr/sw set appropriately
A is an .aq (acquire)
B is a .rl (release)
both A and B are .aqrl
either A or B is an AMO .aqrl
there is an address, control, or data dependency between A and B, except for control dependencies where B is a load

RVTSO RULES IN A NUTSHELL (these are strictly stronger than RVWMO)

Other than the above, A guaranteed to happen before B only if one of:

B is a store
A is a load
there is a fence in between, with pw and sr
either A or B is an AMO

mailing list thread, Dec 1, with discussion of the above slide presentation, and a memory-model-spec.pdf (everything in that document appears to have been copied into the RISC-V spec Github, see below regarding links to https://github.com/riscv/riscv-isa-manual , at a later date, so i'd read that instead of this): https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/hKywNHBkAXM

The main topic in the list discussion is that some participants strongly support dropping any standardization of a TSO extension, and strongly prefer WMO to TSO. An argument for WMO over TSO is [20]. An argument for not having any sort of officially blessed TSO extension is [21] and [22]; in sum, they argue that, if some chips are TSO, some software writers will happen to have these chips and so will test their programs on these chips, and then they won't realize, or won't care, that their software is TSO-dependent; then later devs for the same software will be forced to only buy TSO-compliant chips; since WMO-compliant software can run on TSO-compliant chips but TSO-compliant software cannot run on WMO-compliant chips, the TSO-compliant chip market will be entrenched; there is a positive feedback loop where if such chips become common then even eg GCC maintainers or writers of important libraries might have them, and so will try to write WMO-compliant chips but fail, and so in the long run the TSO-compliant chips may even dominate the WMO-compliant chip market; they argue that it's better to not bless TSO with an official extension because in many contexts writing software known to depend on nonstandard vendor-specific extensions is less acceptable than writing software meeting a standard.

mb see also https://github.com/riscv/riscv-isa-manual/blob/master/src/memory.tex , which was updated Dec 13, and section Memory Consistency Model in https://github.com/riscv/riscv-isa-manual/blob/master/src/rv32.tex (probably easier just to download and compile the spec and browse the resulting PDF; these parts are Appendix A, memory consistency model, and section "Memory Consistency Model" of Chapter RV32I Base Integer Instruction Set)

from that spec/manual:

" RISC-V Instruction Memory Accesses: l{b

s{blr load lr.aq load-acquire-RCpc lr.aqrl load-acquire-RCsc lr.rl (deprecated) sc store sc.rl store-release-RCpc sc.aqrl store-release-RCsc sc.aq (deprecated) amo<op> load; <op>; store amo<op>.aq load-acquire-RCpc; <op>; store amo<op>.rl load; <op>; store-release-RCpc amo<op>.aqrl load-SC; <op>; store-SC

h	w	d} load ∗ (∗ : possibly multiple if misaligned)
h	w	d} store ∗ (∗ : possibly multiple if misaligned)

...

Definition of the RVWMO Memory Model (a lengthly section with a lot of rules)...

Definition of the RVTSO Memory Model

RISC-V cores which implement Ztso impose RVTSO onto all memory accesses. RVTSO behaves just like RVWMO but with the following modifications:

All l{b

r} instructions behave as if .aq is set

All s{b

c} instructions behave as if .rl is set

All AMO instructions behave as if .aq and .rl are both set

These rules render PPO rules 1 and 8–16 redundant. They also make redundant any non-I/O fences that do not have both .pw and .sr set. Finally, they also imply that all AMO instructions are fully-fenced; nothing will be reordered past an AMO.

(note: the opcodes to actually do the above don't exist yet, see [23]; this, combined with a desire for speed (not delaying the TSO memory model until those opcodes are added) appears to be the critical reason why a standard RVTSO is even included, according to that mailing list message)

" A.6 Code Porting Guidelines Normal x86 loads and stores are all inherently acquire and release operations: TSO enforces all load-load, load-store, and store-store ordering by default. All TSO loads must be mapped onto l{b

or onto fence rw,w; s{bl{bibility in case such instructions are added to the ISA one day. However, in the meantime, the assembler will generate the same fence-based and/or amoswap-based versions for these pseudoin- structions. x86 atomics using the LOCK prefix are all sequentially consistent and when ported naively to RISC-V must be marked as .aqrl. A Power sync/hwsync fence, an ARM dmb fence, and an x86 mfence are all equivalent to a RISC-V fence rw,rw. Power isync and ARM isb map to RISC-V fence.i. A Power lwsync map onto fence.tso, or onto fence rw,rw when fence.tso is not available. ARM dmb ld and dmb st fences map to RISC-V fence r,rw and fence w,w, respectively.

h	w	d}; fence r,rw, and all TSO stores must either be mapped onto amoswap.rl x0
h	w	d}. Alternatively, TSO loads and stores can be mapped onto
h	w	d}.aq and s{b	h	w	d}.rl assembler pseudoinstructions to facilitate forwards compat-

A direct mapping of ARMv8 atomics that maps unordered instructions to unordered instructions, RCpc instructions to RCpc instructions, and RCsc instructions to RCsc instructions is likely to work in the majority of cases. Mapping even unordered load-reserved instructions onto lr.aq (particularly for LR/SC pairs without internal data dependencies) is an even safer bet, as this ensures C/C++ release sequences will be respected. However, due to a subtle mismatch between the two models, strict theoretical compatibility with the ARMv8 memory model requires that a naive mapping translate all ARMv8 store conditional and load-acquire operations map onto RISC- V RCsc operations. Any atomics which are naively ported into RCsc operations may revert back to the straightforward mapping if the programmer can verify that the code is not relying on an ordering from the store-conditional to the load-acquire (as this is not common).

The Linux fences smp mb(), smp wmb(), smp rmb() map onto fence rw,rw, fence w,w, and fence r,r, respectively. The fence smp read barrier depends() map to a no-op due to preserved pro- gram order rules 8–10. The Linux fences dma rmb() and dma wmb() map onto fence r,r and fence w,w, respectively, since the RISC-V Unix Platform requires coherent DMA. The Linux fences rmb(), wmb(), and mb() map onto fence ri,ri, fence wo,wo, and fence rwio,rwio, respectively.

The C11/C++11 memory order * primitives should be mapped as shown in Table A.1. The memory order acquire orderings in particular must use fences rather than atomics to ensure that release sequences behave correctly even in the presence of amoswap. The memory order release mappings may use .rl as an alternative.

C/C++ Construct RVWMO Mapping Non-atomic load l{b

atomic load(memory order relaxed) l{batomic load(memory order acquire) l{batomic load(memory order seq cst) fence rw,rw; l{bNon-atomic store s{batomic store(memory order relaxed) s{batomic store(memory order release) fence rw,w; s{batomic store(memory order seq cst) fence rw,rw; s{batomic thread fence(memory order acquire) fence r,rw atomic thread fence(memory order release) fence rw,w atomic thread fence(memory order acq rel) fence.tso atomic thread fence(memory order seq cst) fence rw,rw

h	w	d}
h	w	d}
h	w	d}; fence r,rw
h	w	d}; fence r,rw
h	w	d}
h	w	d}
h	w	d}
h	w	d}

Table A.1: Mappings from C/C++ primitives to RISC-V primitives.

It is also safe to translate any .aq, .rl, or .aqrl annotation into the fence-based snippets of Table A.2. These can also be used as a legal implementation of l{b

doinstructions for as long as those instructions are not added to the ISA.

d} or s{b

d} pseu-

Ordering Annotation Fence-based Equivalent l{b

l{bs{bs{bamo<op>.aq amo<op>; fence r,rw amo<op>.rl fence rw,w; amo<op> amo<op>.aqrl fence rw,rw; amo<op>; fence rw,rw

h	w	d	r}.aq l{b	h	w	d	r}; fence r,rw
h	w	d	r}.aqrl fence rw,rw; l{b	h	w	d	r}; fence r,rw
h	w	d	c}.rl fence rw,w; s{b	h	w	d	c}
h	w	d	c}.aqrl fence rw,w; s{b	h	w	d	c}

Table A.2: Mappings from .aq and/or .rl to fence-based equivalents. An alternative mapping places a fence rw,rw after the existing s{b

l{b"

h	w	d	c} mapping rather than at the front of the
h	w	d	r} mapping.

note: so, as expected, looks like surrounding stuff with "fence rw,rw" is good enuf, except for I/O, which requires fence rwio,rwio.

so, looks like our just providing one 'fence' instruction (at least initially) is sufficent.

---

so i havent yet read and digested everything in the previous section with an eye towards how it should affect Oot. But some initial notes:

we see how RISC-V got itself into a bit of a pickle by not being able to represent the instructions needed to implement RVTSO on the compiler level because it hadn't yet assigned opcodes for that ("The problem with just setting every load and store to be just .aq and .rl today is that the opcodes just don't exist"). So we should have a way to do that (or stronger)
sounds like RVWMO is one of the weaker memory models out there, and similar to ARMv8; it is said to be stronger than Alpha, but everyone says that Alpha is so weak that it's crazy, so that's obviously out of fashion. Do enough architectures support something at least as strong as RVWMO such that it would be reasonable to demand that every Boot implementation provides RVWMO?
e says at one point "The most important use cases will standardize around common extensions, such as "RV64GC" for the Unix platform spec", so maybe look into that RV64GC and see what it is.
"Our preliminary investigations in this area indicate that the biggest problem with the C11 memory model is that there are very few nontrivial C programs that are data-race free and therefore all concurrent C code depends on undefined behaviour. This has been one of the key driving forces in the proposal for the OCaml memory model to include a notion of local data-race freedom. "
i was thinking we dont have to worry about all this stuff until BootX?, b/c we Boot doesnt standardize any concurrency stuff, but i read something in there that suggested that POWER might even nonintuitively reorder one processor's own reads and writes? If there are major extant systems that do that, then we probably do need to say something like 'our memory consistency model is RISC-V's RVWSO' even in Boot.
not relevant for us, but i think this message gives some examples of widely used code which motivates RVWVO's choice of SYNTACTIC rather than semantic dependencies [24]
seems like after all this work, the FENCE instruction is left still doing something pretty comprehensible; things before the fence are ordered before things after the fence (it also has all these options, to only order SOME instructions, but we don't need them). Good, because in BootX? we'll pretty much only have FENCE instructions.
- they do say that "Finally, we note that since RISC-V uses a multi-copy atomic memory model, programmers can reason about fences and the .aq and .rl bits in a thread-local manner. There is no complex notion of “fence cumulativity” as found in memory models which are not multi-copy atomic.". So i guess we want multi-copy atomicity
- recall that "multi-copy atomicity means that, although it may be permitted for a processor to read its own writes before other processors can read them, all other processors get the write at the same time (ordering-wise, that is; not wall-clock-time)"
- so we still want to require this property

---

issues with MIPS:

tropo 126 days ago [-]

MIPS has numerous defects. There is a legacy wart in the form of a delay slot that doesn't match modern pipelines; this causes all sorts of annoyances. The MMU doesn't use a hardware-walked tree, cutting into performance with cache misses and even code execution. Forming addresses requires a silly number of instructions, or alternately you give up and just load relative to a specific register. The architecture fails to specify a coherent fully physical cache, causing all sorts of performance-killing trouble in OS kernels. There are wasted bits, commonly in the "shamt" field. The "hi" and "lo" registers interfere with scheduling multiplication and division.

bobsam 126 days ago [-]

Are you familiar with the newer mips revisions?

They have been modernizing the architecture during the last 10 years.

---

The basic difference among RISC ISAs is the load instruction addressing because data has to be loaded first before being used. Therefore the key is access to memory, then data can be used in computation using the simple add, subtract, multiply, divide, and, or, xor instructions. "

---

"Bitcoin Script has some drawbacks. Many operations were disabled by Bit- coin’s creator, Satoshi Nakamoto [21]. This has left Bitcoin Script with a few arithmetic (multiplication was disabled), conditional, stack manipulation, hashing, and digital-signature verification operations.... All Bitcoin Script operations are pure functions of the machine state expect for the signature-verification operations. These signature-verification operations re- quire a hash of some of the transaction data. Together, this means the pro- gram’s success or failure is purely a function of the transaction data. Therefore, the person creating a transaction can know whether the transaction they have created is valid or not...Bitcoin Script is also amenable to static analysis, which is another desirable property. The digital-signature verification operations are the most expensive operations. Prior to execution, Bitcoin counts the number of occurrences of these operations to compute an upper bound on the number of expensive calls that may occur. Programs whose count exceeds a certain threshold are invalid"

---

should probably read this: https://blog.lizzie.io/linux-containers-in-500-loc.html

also interesting note in the discussion:

Bromskloss 1 hour ago [-]

She mentions five Linux kernel mechanisms – "namespaces", "capabilities", "cgroups", and "setrlimit". Is any of those what I should use if I want to run an application inside some kind of container that lets me intercept file system calls (for example for the purpose of creating a file on the fly as it is accessed)?

simcop2387 1 hour ago [-]

Seccomp with ptrace is the way I'd do this. You can setup the rules to signal the ptracing process to intercept the syscall. I've not done it before but it should be possible. Id also look at doing it in a mount namespace with overlayfs on top of everything the process can see, so that you can manipulate anything you want or need filewise without destroying the original system. Then you can copy out any changed files later if you want to preserve them.

---

http://en.cppreference.com/w/c/atomic

has

 atomic_flag_test_and_setatomic_flag_test_and_set_explicit(C11) sets an atomic_flag to true and returns the old value (function) atomic_flag_clearatomic_flag_clear_explicit (C11) sets an atomic_flag to false (function) atomic_init (C11) initializes an existing atomic object (function) atomic_is_lock_free (C11) indicates whether the atomic object is lock-free (function) atomic_storeatomic_store_explicit (C11) stores a value in an atomic object (function) atomic_loadatomic_load_explicit (C11) reads a value from an atomic object (function) atomic_exchangeatomic_exchange_explicit (C11) swaps a value with the value of an atomic object (function) atomic_compare_exchange_strongatomic_compare_exchange_strong_explicitatomic_compare_exchange_weakatomic_compare_exchange_weak_explicit (C11) swaps a value with the an atomic object if the old value is what is expected, otherwise reads the old value (function) atomic_fetch_addatomic_fetch_add_explicit (C11) atomic addition (function) atomic_fetch_subatomic_fetch_sub_explicit (C11) atomic subtraction (function) atomic_fetch_oratomic_fetch_or_explicit (C11) atomic logical OR (function) atomic_fetch_xoratomic_fetch_xor_explicit (C11) atomic logical exclusive OR (function) atomic_fetch_andatomic_fetch_and_explicit (C11) atomic logical AND (function) atomic_thread_fence (C11) generic memory order-dependent fence synchronization primitive (function) atomic_signal_fence (C11) fence between a thread and a signal handler executed in the same thread (function)

atomic_bool _Atomic _Bool atomic_char _Atomic char atomic_schar _Atomic signed char atomic_uchar _Atomic unsigned char atomic_short _Atomic short atomic_ushort _Atomic unsigned short atomic_int _Atomic int atomic_uint _Atomic unsigned int atomic_long _Atomic long atomic_ulong _Atomic unsigned long atomic_llong _Atomic long long atomic_ullong _Atomic unsigned long long atomic_char16_t _Atomic char16_t atomic_char32_t _Atomic char32_t atomic_wchar_t _Atomic wchar_t atomic_int_least8_t _Atomic int_least8_t atomic_uint_least8_t _Atomic uint_least8_t atomic_int_least16_t _Atomic int_least16_t atomic_uint_least16_t _Atomic uint_least16_t atomic_int_least32_t _Atomic int_least32_t atomic_uint_least32_t _Atomic uint_least32_t atomic_int_least64_t _Atomic int_least64_t atomic_uint_least64_t _Atomic uint_least64_t atomic_int_fast8_t _Atomic int_fast8_t atomic_uint_fast8_t _Atomic uint_fast8_t atomic_int_fast16_t _Atomic int_fast16_t atomic_uint_fast16_t _Atomic uint_fast16_t atomic_int_fast32_t _Atomic int_fast32_t atomic_uint_fast32_t _Atomic uint_fast32_t atomic_int_fast64_t _Atomic int_fast64_t atomic_uint_fast64_t _Atomic uint_fast64_t atomic_intptr_t _Atomic intptr_t atomic_uintptr_t _Atomic uintptr_t atomic_size_t _Atomic size_t atomic_ptrdiff_t _Atomic ptrdiff_t atomic_intmax_t _Atomic intmax_t atomic_uintmax_t _Atomic uintmax_t

---

sequentially consistent accesses are broken in C++ concurrency!:

" Our model supports all features of C++ concurrency except con- sume reads and SC accesses. Consume reads are widely considered a premature aspect of the C++11 standard and are currently im- plemented the same as acquire reads in mainstream compilers. In contrast, SC accesses are a major feature of C++, and originally our model included an account of SC accesses as well. However, in the course of trying to mechanize correctness of compilation to Power (§5.3), we discovered that our semantics of SC accesses was flawed, and this led us to discover a flaw in the C++11 standard as well! (See [ 19 ] for further details.) Thus, a proper handling of SC accesses remains an open and important problem for future work.

19: Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. Repairing sequential consistency in C/C++11. Technical Report MPI-SWS-2016-011, MPI-SWS, November 2016. "

---

https://people.mpi-sws.org/~dreyer/papers/promising/paper.pdf

" In this paper, we present what we believe is a very promising way forward: the first relaxed memory model to support a broad spectrum of features from the C++ concurrency model while also satisfying all three criteria listed in §1.1. We achieve these ends through a combination of mechanisms (some standard, some not), but the most important and novel idea for the reader to take away from this paper is the notion of promises. Under our model, which is defined by an operational semantics, a thread T may nondeterministically “promise” to write a value v to a memory location x at some point in the future. From the point of view of other threads, a promise is no different from an ordinary write: once T has promised to write v to x , other threads can read from that write. (In contrast, T cannot read from its own promised write until T has fulfilled the promise: this is crucial to preserve basic sanity of the semantics.) Intuitively, promises simulate the effect of read-write reorderings by allowing write events to be visible to other threads before the point at which they occur in the program order. We must, however, ensure that promises do not introduce bad OOTA (((out-of-thin-air))) behaviors. Toward this end, we only allow T to promise to write v to x if it is possible to thread-locally certify that the promise can be fulfilled in a finite number of steps. That is, we must show that T will be able to write v to x after some finite sequence of steps of T ’s execution (i.e., with no help from other threads). The certification requirement guarantees absence of bad OOTA executions by ensuring that T can only promise to write a value v to x if T could have written v to x anyway.

...

Our model supports all features of C++ concurrency except con- sume reads and SC accesses. Consume reads are widely considered a premature aspect of the C++11 standard and are currently im- plemented the same as acquire reads in mainstream compilers

(((and SC accesses, although important, are wrong in C++ too, see above secteon)))

...

all the existing implementations of C++, even for weaker architectures like Power and ARM, guarantee at a bare minimum a property we call per-location coherence (aka SC-per-location ). Per-location coherence says that, even though threads may observe writes to different locations in different orders, they must observe writes to the same location in a single total order (called the “modification order” in C++ lingo). In addition to being supported by hardware, per-location coherence is preserved by common compiler optimizations as well. Hence, we want our semantics of relaxed accesses to guarantee it. (In §4.3 we will present an even weaker mode of accesses that does not provide full per-location coherence.)

...

The model should be implementable , i.e., it should validate com- mon compiler optimizations, as well as standard compilation schemes to the major modern architectures. To be implementable, it must justify many kinds of instruction reordering and merging.
The model should support high-level reasoning principles that programmers and compiler analyses depend on. At a bare min- imum, it should validate simple invariant-based verification, and should provide some “DRF” guarantees [ 4 ], ensuring that programmers who employ sufficient synchronization need not understand the full complexities of relaxed-memory semantics.
The model should ideally avoid relying on undefined behavior to define the semantics of racy programs. This is a prerequisite for applicability to type-safe languages like Java, in which well- typed programs may contain data races but are nevertheless expected to have safe, well-defined semantics.

Both Java and C++ fail to achieve some of these criteria. In the case of Java, the memory model fails to validate a number of common program transformations performed by real Java compilers, such as redundant read-after-read elimination and “roach motel” reordering [26]. Although this problem has been known for some time, a satisfactory solution has yet to be developed.

In the case of C++, the memory model relies crucially on undefined behaviors to give semantics to racy programs. Moreover, it permits certain “out-of-thin-air” executions which violate basic invariant-based reasoning (and DRF guarantees) [7]. "

---

How does the memory model in https://people.mpi-sws.org/~dreyer/papers/promising/paper.pdf compare to the WMM and WMM-S memory models given in https://arxiv.org/pdf/1707.05923.pdf ? Are there other research memory models out there that are competitive with these?

---

https://github.com/rmccullagh/como-lang-ng/blob/master/como_opcode.c

's opcodes:

const char * const str_opcodelist[] = { "INONE", "LOAD_CONST", "STORE_NAME", "LOAD_NAME", "IS_LESS_THAN", "JZ", "IPRINT", "IADD", "JMP", "IRETURN", "NOP", "LABEL", "HALT" };

sounds good...

smallvm's:

define OP_SET 0x01 /* Sets a register to a value or to the contents of another register */
define OP_ADD 0x02 /* Adds two values or register contents */
define OP_SUB 0x03 /* Subtracts two values or register contents */
define OP_MULT 0x04 /* Multiplies two values or register contents */
define OP_DIV 0x05 /* Divides two values or register contents */
define OP_MOD 0x06 /* Mods two values or register contents */
define OP_STORE 0x07 /* Stores a value into memory */
define OP_GET 0x08 /* Get a value from memory */
define OP_JMP 0x09 /* Jump to another location in memory */
define OP_IF 0x0A /* Performs the next instruction if the values are equal */
define OP_IFN 0x0B /* Performs the next instruction if the values are not equal */

sounds good...

---

High-Performance Extendable Instruction Set Computing http://researchbank.rmit.edu.au/eserv/rmit:2517/n2001000381.pdf

The availability of sixteen general-purpose registers is close to optimum
Load and Store instructions are used often and mostly employ short length offset addressing
The frequency of use of small sized constants is high. "

" load/store instructions that reference the Stack Pointer and the Index Register tend to exhibit very different operand lengths. For Stack Pointer use, the offset needs to be more than 5 bits, while the majority (77%) of the Index Register load/store operations can utilize a 3-bit operand "

note: i haven't even skimmed this article yet, i just quickly glanced at it to see what it was about and happenend to see those two quotes

---

The 6502 is nearly a RISC machine in number of machine cycles per instruction (about 2 average) yet has powerful addressing modes for table look-up-driven real-time software. The indirect, indexed addressing mode has yet to be beat by any RISC machine, which takes too many instructions to do the same thing. " -- [25]

" Indirect-indexed addressing

In this commonly used Addressing mode, the Y Index Register is used as an offset from the given zero page vector. The effective address is calculated as the vector plus the value in Y.

Indirect-indexed addressing is written as follows:

     LDY #$04
     LDA ($02),Y

In the above case, Y is loaded with four (4), and the vector is given as ($02). If zero page memory $02-$03 contains 00 80, then the effective address from the vector ($02) plus the offset (Y) would be $8004.

This addressing mode is commonly used in array addressing, such that the array index is placed in Y and the array base address is stored in zero page as the vector. Typically, the value in Y is calculated as the array element size multiplied by the array index. For single byte-sized array elements (such as character strings), the value in Y is the array index without modification. " [26]

---

kuba Posts: 39 August 2011 edited August 2011 0 In the nutshell: Propeller does have WAITCNT and WAITVID, on XS1 you have one WAIT instruction that you can use to wait for any combination of events from various peripherals. And that's only the beginning.

XS1 has a fairly powerful software-controlled interrupt vectoring system. The interrupt vectors are not permanently assigned to peripherals, like in many MCUs. Instead, you can assign any vector to any event-generating peripheral (I/O port, etc). When the event happens, the vector points to the next instruction to be scheduled for given thread. A vector is specific to a thread, so you have full thread affinity for responding to external events.

The classical problem of what to do if different events all reuse same interrupt handler (vector) is handled very nicely, too. Normally you have to interrogate status bits to know what happened, if the handler could be triggered by different things. On XS1, to each event source you assign a so-called environment vector. It's simply a data word that's available in your interrupt/event handler, and lets you adjust your logic according to interrupt source. You can use it as a bitmask, as a jump offset or table offset, or whatever suits your application. I haven't seen anything like that on any of the mainstream MCUs -- feel free to correct me if I'm wrong. You normally have to emulate this by setting up code to write a value somewhere, then jump to a common handler code. This costs precious cycles. On XS1, an event/interrupt handler can be done in a couple thread cycles -- say in 80ns. That's less than one clock period on some MCUs.

The major difference between XS1 and Propeller (P1 and P2 both) is that Propeller has no interrupt support at all. On XS1, event/interrupt support enables essentially free event-driven switch statements. You can wait on many things to happen, and there's no time penalty for that. Waiting on one event is no different from waiting on 10 events, in terms of latency. Of course if two things happen at the same time, you can't process them concurrently in the same thread, but at least your code doesn't get any slower from trying to wait on many things in the first place. This is IMHO a very sane design decision.

The difference between events and interrupts is fairly simple on XS1. An Event handler does not preserve the PC. You have to be within a WAIT instruction for an event to fire. It is like a hardware-driven switch statement. You have sole control of the execution path after you're done handling an event. An Interrupt does the usual automagical PC/status storage in registers dedicated for that purpose, so there's no memory access overhead for that.

---

one thought i (bayle) had from reading a bunch of other stuff on the forums on Propeller vs. XMOS MCUs:

although both software and hardware guys care about determinism, hardware guys also care about timing, as in, wall-clock time

---

"Some cores have significantly higher performance -- for example, the ARM Cortex-M4 has DSP instructions and usually floating-point, and the Cortex-M7 has cache IIRC.

---

westfw

Quote (((from brucehoul)))

    I've come to the conclusion that 16 bit *instructions* are very much in a sweet spot, either fixed size, or with a way to escape to the occasional 32 bit instruction.

Somewhat agree... " [27]

---

suggests that an 8-level stack may be okay for really low-level stuff (we're apparently talking about an MCU with only 256k of memory though!):

" The HT66 feels quite similar in design to a Microchip PIC16: a 4-cycle single-accumulator RISC architecture, with an 8-level stack for saving the PC address, plus a banking arrangement used to address more than 256 bytes of memory. Unlike the PIC16, there’s a single 128-byte SFR set placed at the bottom of RAM. The remaining 128 bytes of addressable space are split into two banks to cover the 256-byte capacity this part has. The 63-instruction ISA is similar to the PIC16, but also includes bit manipulation instructions. "

---

people in MCUs talk about having vectored interrupts be a big deal. They also want different interrupt priority levels, like around 2 ( https://jaycarlson.net/pf/atmel-microchip-tinyavr-1-series/ ), 3 (AVR XMEGA, see https://jaycarlson.net/pf/atmel-microchip-tinyavr-1-series/ ) , 4 (ARM Cortex-M0 "core has a nested vector interrupt controller, with up to 32 interrupt vectors and 4 interrupt priorities"), 7 ("The PIC24 has a vectored exception system similar to ARM microcontrollers; there’s also a seven-priority interrupt controller with up to 118 interrupt sources") . lots of things have 16 GPRs, eg the PIC24. The PIC24 also has multiply and divide instructions.

---

https://blogs.msdn.microsoft.com/oldnewthing/20060706-12/?p=30623

Is the maximum size of the environment 32K or 64K?

A: Both.

The limit is 32,767 Unicode characters, which equals 65,534 bytes. Call it 32K or 64K as you wish, but make sure you include the units in your statement if it isn't clear from context.

---

One other thing to note is that GCC uses the normal convention for function calls: any call-saved registers the function needs will be pushed to the stack by the function and restored before returning. But there’s also a bunch of call-used registers available for user functions to clobber, which makes it easier to write assembly routines, and gives the compiler plenty of room for handling function locals.

This is normal if you come from PC or ARM development, but many MCU architectures and compilers don’t PUSH or POP registers at all; instead, specific registers (or RAM addresses) are set aside for specific functions. The advantage of GCC’s standard calling approach is simplicity, flexibility and the ability to support large projects efficiently — you also get reentrancy for free, which compilers like Keil’s C-51 require you to explicitly request when declaring the function. " [28]

---

" >3. Small general-purpose register file. My vote goes for 8 GP >registers + SP, i.e. one more register than x86.

Madness! :-) Apparently 16 is minimal for some graph coloring heuristics. " -- [29]

but

" In ARM mode, the ARM/Thumb is a RISC-like 32-bit processor with 16 registers. In Thumb mode, a compressed instruction encoding is used, with 16-bit instructions. Most instructions in Thumb mode are two-address, and can only access the first 8 registers." -- [30]

but

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.7342&rep=rep1&type=pdf

should that with less than 12 registers their test algorithms take much more resources

but

i skimmed the first results of https://www.google.com/search?q=%22graph+coloring%22+%2216+registers%22 and didn't see anything about a requirement for 16, in fact many papers talked about both 8 and 16 registers and compared them.

---

so maybe we should have less special-purpose registers and more GPRs. If we did this, we could put the special-purpose registers under an instruction similar to SYSINFO.

Also, we probably want to add a (data) stack frame register in addition to the data stack register. Of course, this could be convention, unless we give it special addressing mode support.

As for addressing modes in Oot (recalling that Oot Boot doesn't support addressing modes), may want to add indirect indexed, which can be useful for referencing arrays. Alternately, instead of assuming a 'zero page' as indirect indexed does, you can just have an addressing mode which adds two registers together and uses the result as the effective addr (one register could hold a pointer and the other could hold an offset). Note that this means you have to split the operand in half. In addition, it might be useful to have an addressing mode which adds a constant to the contents of a register and uses the result as the effective addr; the register could hold a pointer and the constant could be an offset; again we have a split operand. Alternately, the whole operand could be a constant, and the register could implicitly be the stack frame register.

just something to think about. i don't think that commentator was correct that 16 regs are crucial for good graph coloring, and it would be annoying to have to save an opcode or two for a 'pseudoregister' instruction(s).

On the other hand, which is more annoying; having to waste opcodes on pseudoregister instructions, or having some of the registers be 'special', necessitating extra checks in every addressing mode effective address calculation? hmm yeah maybe i'll take the special instructions, please. Otoh this would mean that using the PC as an offset to pointers because awkward. And we do want the stack pointers to be normal registers, right? But what about those addressing modes in Boot that don't accept stack pointers? Arg...

(tentatively added GETSTATE and SETSTATE)

---

https://stackoverflow.com/questions/1518711/how-does-free-know-how-much-to-free?rq=1

it would be nice to have an instruction to check the size of an allocat

(tentatively added MSIZE)

---

some evidence for having:

lots of GPRs
multiple interrupt priorities
an instruction to store/restore registers to/from the stack

" It was also designed for C compilers, too — with 32 registers available at all times, compilers can efficiently juggle around many operands concurrently; the 8051, by comparison, has four banks of eight registers that are only easily switched between within interrupt contexts (which is actually quite useful).

And interrupts are one of the weak points of the AVR core: there’s only one interrupt priority, and depending on the ISR, many registers will have to be pushed to the stack and restored upon exit. In my testing, this often added 10 PUSH instructions or more — each taking 2 cycles. "

---

"In addition to the normal CPU registers, Arm cores have 13 general-purpose working registers, which is roughly the sweet spot."

---

" The core has a nested vector interrupt controller, with up to 32 interrupt vectors and 4 interrupt priorities — plenty when compared to the 8-bit competition ... also has full support for runtime exceptions, which isn’t a feature found on 8-bit architectures. "

---

" One of the biggest problems with ARM microcontrollers (((compared to eg 8-bit and 16-bit MCUs))) is their low code density for anything other than 16- and 32-bit math — even those that use the 16-bit Thumb instruction set. This means normal microcontroller type routines — shoving bytes out a communication port, wiggling bits around, performing software ADC conversions, and updating timers — can take a lot of code space on these parts. Exacerbating this problem is the peripherals, which tend to be more complex — I mean “flexible” — than 8-bit parts, often necessitating run-time peripheral libraries and tons of register manipulation.

Another problem with ARM processors is the severe 12-cycle interrupt latency. When coupled with the large number of registers that are saved and restored in the prologue and epilogue of the ISR handlers, these cycles start to add up. ISR latency is one area where a 16 MHz 8-bit part can easily beat a 72 MHz 32-bit Arm microcontroller. ... Because of its small core and fast interrupt architecture, the 8051 architecture is extremely popular for managing peripherals used in real-time high-bandwidth systems, such as USB web cameras and audio DSPs, and is commonly deployed as a house-keeping processor in FPGAs used in audio/video processing and DSP work. "

---

" The STM8 core has six CPU registers: a single accumulator, two index registers, a 24-bit program counter, a 16-bit stack pointer, and a condition register. "

---

STM8:

" The claim to fame of the core is its comprehensive list of 20 addressing modes, including indexed indirect addressing and stack-pointer-relative modes. There’s three “reaches” for addressing — short (one-byte), long (two-byte), and extended (three-byte) — trading off memory area with performance. "

---

some MCU conclusions from mcuComparisons, after reading [31]:

" if i had to prioritize some of these i'd say:

32-bit Cortex-M0: STM32.
- "ST has a massive collection STM32 Discovery boards — 22 in all — which are a gold-standard of dev boards in my book (along with the MSP430 LaunchPad?)."
- "In the Arm ecosystem, just shut up and buy a J-Link. Seriously. It works in every Arm IDE, with every Arm part on the market. It has the fastest debug speeds, supports any target voltage, and has unlimited software breakpoints. If you’re a student, you can get a J-Link EDU Mini for $18 (cheaper than clones), or the full EDU version for $60. If you’re a professional, buy the $600 commercial version — it’s worth the handful of billable hours you’ll have to charge to pay for it." "Both Freescale and ST enable reprogramming the debugger firmware on their dev kits with SEGGER J-Link firmware. I don’t think anyone in the industry disagrees that J-Link is, by far, the best Arm debugger — and even operating at USB 2.0 full-speed specs (12 Mbps), flash download is snappy, and stepping through code is a breeze."
- with Rust: http://blog.japaric.io/quickstart/
32-bit Tensilica Xtensa: ESP8266
an XMOS XCore dev board (yes they hook together)

note: at one point the author says that some criteria for his choices were:

to summarize even shorter:

32kB limit for program memory low-level interpreter and/or runtime (16k 16-bit words)
16k word limit for RAM (comparable systems have 4k-32k bytes of RAM)
16-bit addressing (64k private local memory space)
think GPGPU, with OpenCL?
think Cortex M0
maybe explore XMOS

--- in reply to a Q about RISC-V simulators:

" The QEMU port is out of date, but there's an active effort going on to update it right now

  https://github.com/riscv/riscv-qemu/pull/70

There's a handful of other ISA simulators available for RISC-V, you can build and boot a kernel on Spike (our ISA golden model) by running "make sim" here:

  https://github.com/sifive/freedom-u-sdk

It's the same kernel image that runs on the FPGA and will run on the ASIC based boards. "

---

how few signals/interrupts can we get away with? (we want to minimize the memory devoted to interrupt handler entry point tables, although i suppose these could just be linked lists). Cortex-M0/M0+/M1 limits vendors to 32 interrupts at most. [32]. Here's a table of ~29-~32 POSIX-defined signals: https://en.wikipedia.org/wiki/Signal_(IPC)#POSIX_signals . https://en.wikipedia.org/wiki/Signal_(IPC)#Miscellaneous_signals lists 7 more nonstandard signals. Even for the standard signals, "For most signals the corresponding signal number is implementation-defined". The ones with standard numbers are listed in [33], which lists 7: SIGHUP, SIGINT, SIGQUIT, SIGABRT, SIGKILL, SIGALRM, SIGTERM.

https://www.elprocus.com/types-of-interrupts-in-8051-microcontroller-and-interrupt-programming/ has 5 interrupts [34]: Timer 0 overflow, Timer 1 overflow, External hardware interrupt INT0, External hardware interrupt INT1, Serial communication interrupt.

PIC16F877 has 15 interrupts [35]. PIC micro has 15 interrupts [36].

AT90USB1287 maybe has 4 interrupts? [37] [38]. ATMega8515 maybe has 18? [39]. ATmega328P has 26, ATtiny4313 has 21, ATtiny85 has 14 [40].

MSP430 has about 32 interrupt priorities and 32 defined interrupts? [41].

The demo figure in [42] has 8 interrupt priority levels.

"Newer x86 systems integrate an Advanced Programmable Interrupt Controller (APIC) that conforms to the Intel APIC Architecture. These APICs support a programming interface for up to 255 physical hardware IRQ lines per APIC, with a typical system implementing support for only around 24 total hardware lines."

"There are 256 interrupt vectors on x86 CPUs, numbered from 0 to 255 which act as entry points into the kernel. The number of interrupt vectors or entry points supported by a CPU differs based on the CPU architecture."

" Common practice is to leave the first 32 vectors for exceptions, as mandated by Intel. However you partition the rest of the vectors is up to you. "

" There are actually two PICs on most systems, and each has 8 different inputs, plus one output signal that's used to tell the CPU that an IRQ occurred. "

linux man 7 signal: " First the signals described in the original POSIX.1-1990 standard.

       Signal     Value     Action   Comment
       ──────────────────────────────────────────────────────────────────────
       SIGHUP        1       Term    Hangup detected on controlling terminal
                                     or death of controlling process
       SIGINT        2       Term    Interrupt from keyboard
       SIGQUIT       3       Core    Quit from keyboard
       SIGILL        4       Core    Illegal Instruction
       SIGABRT       6       Core    Abort signal from abort(3)
       SIGFPE        8       Core    Floating-point exception
       SIGKILL       9       Term    Kill signal
       SIGSEGV      11       Core    Invalid memory reference

       SIGPIPE      13       Term    Broken pipe: write to pipe with no
                                     readers; see pipe(7)
       SIGALRM      14       Term    Timer signal from alarm(2)
       SIGTERM      15       Term    Termination signal
       SIGUSR1   30,10,16    Term    User-defined signal 1
       SIGUSR2   31,12,17    Term    User-defined signal 2
       SIGCHLD   20,17,18    Ign     Child stopped or terminated
       SIGCONT   19,18,25    Cont    Continue if stopped
       SIGSTOP   17,19,23    Stop    Stop process
       SIGTSTP   18,20,24    Stop    Stop typed at terminal
       SIGTTIN   21,21,26    Stop    Terminal input for background process
       SIGTTOU   22,22,27    Stop    Terminal output for background process

       The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.

       Next the signals not in the POSIX.1-1990 standard but described in SUSv2 and POSIX.1-2001.

       Signal       Value     Action   Comment
       ────────────────────────────────────────────────────────────────────
       SIGBUS      10,7,10     Core    Bus error (bad memory access)
       SIGPOLL                 Term    Pollable event (Sys V).
                                       Synonym for SIGIO
       SIGPROF     27,27,29    Term    Profiling timer expired
       SIGSYS      12,31,12    Core    Bad system call (SVr4);
                                       see also seccomp(2)
       SIGTRAP        5        Core    Trace/breakpoint trap
       SIGURG      16,23,21    Ign     Urgent condition on socket (4.2BSD)
       SIGVTALRM   26,26,28    Term    Virtual alarm clock (4.2BSD)
       SIGXCPU     24,24,30    Core    CPU time limit exceeded (4.2BSD);
                                       see setrlimit(2)
       SIGXFSZ     25,25,31    Core    File size limit exceeded (4.2BSD);
                                       see setrlimit(2)...

       Next various other signals.

       Signal       Value     Action   Comment
       ────────────────────────────────────────────────────────────────────
       SIGIOT         6        Core    IOT trap. A synonym for SIGABRT
       SIGEMT       7,-,7      Term    Emulator trap
       SIGSTKFLT    -,16,-     Term    Stack fault on coprocessor (unused)
       SIGIO       23,29,22    Term    I/O now possible (4.2BSD)
       SIGCLD       -,-,18     Ign     A synonym for SIGCHLD
       SIGPWR      29,30,19    Term    Power failure (System V)
       SIGINFO      29,-,-             A synonym for SIGPWR
       SIGLOST      -,-,-      Term    File lock lost (unused)
       SIGWINCH    28,28,20    Ign     Window resize signal (4.3BSD, Sun)
       SIGUNUSED    -,31,-     Core    Synonymous with SIGSYS"

so looks like they have about 32 'value' numbers available. But also:

" Real-time signals Starting with version 2.2, Linux supports real-time signals as originally defined in the POSIX.1b real-time extensions (and now included in POSIX.1-2001). The range of supported real-time sig‐ nals is defined by the macros SIGRTMIN and SIGRTMAX. POSIX.1-2001 requires that an implementa‐ tion support at least _POSIX_RTSIG_MAX (8) real-time signals.

       The Linux kernel supports a range of 33 different real-time signals, numbered 32 to 64."

Note that thread descriptors in Pthreads probably don't all have to have such a table of entry points: " It is not possible to install "per-thread" signal handlers.

From man 7 signal (emphasis by me):

    The signal disposition is a per-process attribute: in a multithreaded application, the disposition of a particular signal is the same for all threads."

This suggests that we might want to start with a limit of between 8-32 signals (inclusive).

I'm leaning towards 16 (b/c it's in the middle of that range) or 32 (because it accomodates all AVRs and Cortex M0 ARMs).

16 interrupts requires 1 16-bit word to individually mask. If there are 16 priority levels (4 bits), that requires a further 4 words to set the priority levels (unless we just say that the interrupt number also determines its priority). Plus 16 words for the entry points. For a total of 21 words or so.

32 interrupts requires 2 16-bit words to individually mask. If there are 16 priority levels (4 bits), that requires a further 8 words to set the priority levels. Plus 32 words for the entry points. For a total of 42 words or so.

A cache line in some desktop processors is 64 bytes.

What's the conservative thing to do here? More interrupts wastes (about 21 words of) memory in every process. But too few is probably a more serious problem, akin to running out of address space by using a low bitwidth. So maybe 32 is more conservative. Otoh the cost of having too few interrupts is probably just to software multiplex the additional hardware interrupt onto a catch-all VM interrupt, which increases the latency of dealing with interrupts being multiplexed but doesn't affect the others -- and we're not exactly targeting bare-metal real-time MCU control applications here.

I guess the practical thing to do is 32. But i really want to do 16, because i hate wasting 32 words for the entry points.

But let's be practical. 32.

One thing to consider is that, instead of fixing a memory layout, if we have separate 'commands' or even opcodes to tell the VM to alter signal #x, then we don't have to prespecify the number of signals. But commands would have to be provided to set the masking and the priority as well as the entry point. Also remember that masking all (except non-maskable) signals should be quick.

This can all go in BootX?.

..i dunno man, are we really trying to make a structure for signals that allows us to efficiently map all external signals directly into Oot? Or are we just trying to have our own little mini intra-Oot signals? In the latter case, we could not have separate opcodes, and have just 16 signals, and have a fixed memory map, which may be slightly simpler. Mapping external signals into Oot could be the work of further libraries.

If we just want an intra-oot signal-like-thing, we could have 16 signal entry points, 2 priorities, so one word for masking, 16 words for entry points, one word for priorities, and one word for which signals are currently pending, for a total of 19 words.

otoh we like generality, so maybe the opcode thing is worth it.

i guess the questions are:

are signals in Boot an OS or I/O-interface thing, like open/close/read/write, or an intra-Boot thing?
if the former, do we care enough about them to make them part of BootX?, or should we leave this to external libraries or implementation-dependent hacks?
if the latter, then is this an HLL thing where we care about generality, or an assembly thing where we prefer simplicity (and fixed maximums)?

if we want generality, we probably want opcodes for:

mask/unmask signal
set entry point
get/set priority
get/set subpriority (for when we dont want one signal to interrupt another, but we still want to prioritize among pending signals)
get mask status
get entry point
send signal (invoke entry point, or set pending if cannot)
get pending
clear pending

Alternately, we want to define a 'signal descriptor' struct and then have an opcode to fetch the address of a signal descriptor.

Maybe we should lump this all into lumpy SETMODE and GETMODE opcodes, which could take many arguments on the stack. Hmm, that seems pretty attractive.

However, another argument for a hard limit on # of signals is to promote interoperability; otherwise some Oot programs may assume that, say, 64 signals are available, while some Oot implementations will only offer 16.

Another issue with all this is that most platforms probably won't support nested signals with priorities and subpriorities, so guaranteeing this forces a lot more work on BootX? implementors. Maybe simpler to forget about nesting and priorities. Some platforms may not even support individually masking certain signals.

Otoh we could always just provide all this and then allow some implementations to not support it (eg not support signals at all), or not support parts of it (eg support signals and global signal masking, but not local signal masking, priorities, or nested signals).

Also, we could seperate the 'internal' signal numbers from 'external' ones, and then provide some way to wire them up, eg to say "when external interrupt 0x80 occurs, that corresponds to signal 33 in BootX?'.

Currently i'm leaning towards saying that signal support is optional, but if supported, at least 16 signals must be supported, and using GETMODE and SETMODE.

Of course, signals must have data attached.

And/or, we could unify signals and channels. So upon receiving data on some channels, the data just sits there until you check it (unless you are waiting for it via 'poll'), but upon receiving data on other channels, it asynchronously interrupts you and transfers control to the signal handler. Hmm, that sounds intriguing... so are GETMODE and SETMODE now operating on individual channels/fids/IO-addresses? Are they like fstat? What about global 'modes', like masking all maskable signals? Is this a GET/SETMODE on '0'? Can we/should we unify GETMODE and SYSINFO -- mb SYSINFO is GETMODE on channel 0, and a compiler that wants to statically detect a SYSINFO must use constant propagation to staticly detect that the channel is 0 -- wait no we have a zero register, no constant propagation needed.

I like this. Yes, unify signals and channels. Yes, use GET/SETMODE. Yet, SYSINFO = GETMODE 0. Yes, signals are optional. Yes, allow the implementation to request binding 'external' signals to 'internal' ones -- or just to prebind well-known signal numbers in a certain way (reserve some signal numbers for this). Yes, let the implementation determine the number of signals. If GETMODE is two operand, (register direct, immediate), then SYSINFO k == GETMODE R0 k --- otoh we may want to allow GETMODE to take a third operand from SMALLSTACK -- but if we did that it gets harder to statically analyze, so mb not -- 16 is already a lot.

---