proj-oot-ootAssemblyNotes18

---

should probably read at least

https://en.wikipedia.org/wiki/RISC-V

---

for OSs, hypervisors, etc, some thoughts:

" the ecall instruction, the common instruction used to call the next most privileged layer in the processor. For example, an ecall executed at the User layer will likely be handled at the Supervisor layer. An ecall executed at the Supervisor layer will be handled by the Machine layer. (This is a simplistic explanation that can get more complex with trap redirection, but we won't dive into those waters at this moment).

So, when the Supervisor (Linux kernel) executes ecall, the Machine layer's trap handler is executed in Machine mode. The code can be found in the riscv-pk at trap 9, the mcall_trap function, in machine/mtrap.c. ... The RISC-V privilege model was initially designed as an ecosystem that consists of four separate layers of privilege: User, Supervisor, Hypervisor, and Machine. The User privilege layer is, of course, the least privileged layer, where common applications are executed. Supervisor is the privilege layer where the operating system kernel (such as Linux, mach, or Amoeba) lives. The Hypervisor layer was intended to be the layer at which control subsystems for virtualization would live, but has been deprecated in more recent versions of the privilege specification. ... Each privilege layer is presumed to be isolated from all lower privileged layers during code execution, as one would expect. The CPU itself ensures that registers attributed to a specific privilege layer cannot be accessed from a less privileged layer. Thus, as a policy, Supervisor layer code can never access Machine layer registers. This segmentation helps guarantee that the state of each layer cannot be altered by lower privileged layers.

However, the original privilege specification defined memory protection in two separate places. First, the mstatus register's VM field defines what memory protection model shall be used during code execution. "

-- http://blog.securitymouse.com/2017/04/the-risc-v-files-supervisor-machine.html

to return after an ecall, "sret (supervisor exception return)"

---

where are my notes on that Princeton 'tricheck' paper by Trippel, Manerkar, Lustig, Pellauer, Martonosi, that claimed to find problems in RISC-V's memory consistency meodel/memory ordering/concurrency and that got some press around early 2017?

http://mrmgroup.cs.princeton.edu/papers/ctrippel_ASPLOS17.pdf

i recall skimming some portions of that paper, surely i took notes somewhere, particularly on their recommendations (search the PDF above for 'recommend'; also section 5.1.3 contains a recommendation without using that word) (the following are quotes):

anyways, in case i DIDN'T take notes, there's the notes.

what they are doing about it (forming a task group to revise the memory consistency model): https://riscv.org/2017/04/risc-v-memory-consistency-model/

may 10 talk: Status of the RISC-V Memory Consistency Model https://riscv.org/2017/05/6th-risc-v-workshop-proceedings/ https://riscv.org/wp-content/uploads/2017/05/Wed1000am-Memory-Model-Lustig.pdf https://youtu.be/E5s54AVGV2E

mailing list search: https://groups.google.com/a/groups.riscv.org/forum/#!searchin/isa-dev/memory$20consistency$20model

their task group formation announcement on the mailing list, with details about design choices to be considered:

https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/Oxm_IvfYItY/discussion

google search: https://www.google.com/search?q=risc-v+Memory+consistency+Model&safe=active&client=ubuntu&channel=fs&source=lnt&tbs=cdr%3A1%2Ccd_min%3Amay+2017%2Ccd_max%3A&tbm=

the task group (committee) issued a Memory Consistency Model Addendum 2.2 saying what to do in the meantime to be conservative:

https://docs.google.com/viewer?a=v&pid=forums&srcid=MDQwMTcyODgwMjc3MjQxMjA0NzcBMDUwNzQ0NzcxMjczNjI2NzQwNDEBczVCLTc5VWtCd0FKATAuMQFncm91cHMucmlzY3Yub3JnAXYy mailing list discussion: https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/-p9ch4V9bKM/discussion

the above searches were done on Oct 4 2017.

also, the May talk looks like a good intro to what sorts of things the issues are:

https://riscv.org/wp-content/uploads/2017/05/Wed1000am-Memory-Model-Lustig.pdf https://www.google.com/search?q=risc-v+Memory+consistency+Model&safe=active&client=ubuntu&channel=fs&source=lnt&tbs=cdr%3A1%2Ccd_min%3A5%2F28%2F2017%2Ccd_max%3A&tbm=


" There are ongoing efforts to specify memory models for multi- threaded programming in C, C++ [57] and other languages. These efforts are influenced by the type of memory models that can be supported efficiently on existing architectures like x86, POWER and ARM. While the memory model for x86 [46, 51, 54] is cap- tured succinctly by the Total Store Order (TSO) model, the models for POWER [52] and ARM [25] are considerably more complex. The formal specifications of the POWER and ARM models have required exposing microarchitectural details like speculative exe- cution, instruction reordering and the state of partially executed in- structions, which, in the past, have always been hidden from the user. ... SC [38] is the most intuitive memory model, but naive implemen- tations of SC suffer from poor performance. ... Instead the manufactures and researchers have chosen to present weaker memory model interfaces, e.g. TSO [58], PSO [61], RMO [61], x86 [46, 51, 54], Processor Consistency [30], Weak Consis- tency [24], RC [27], CRF [55], POWER [33] and ARM [9]. The tutorials by Adve et al. [1] and by Maranget et al. [44] provide re- lationships among some of these models. The lack of clarity in the definitions of POWER and ARM mem- ory models in their respective company documents has led some researchers to empirically determine allowed/disallowed behaviors [8, 25, 41, 52]. Based on such observations, in the last several years, both axiomatic models and operational models have been devel- oped which are compatible with each other [3–5, 7, 8, 25, 41, 52, 53]. However, these models are quite complicated; for example, the POWER axiomatic model has 10 relations, 4 types of events per in- struction, and 13 complex axioms [41], some of which have been added over time to explain specific behaviors [4–6, 41]. The ab- stract machines used to describe POWER and ARM operationally are also quite complicated, because they require the user to think in terms of partially executed instructions [52, 53]. ... Adve et al. defined Data-Race-Free-0 (DRF0), a class of pro- grams where shared variables are protected by locks, and proposed that DRF0 programs should behave as SC [2]. Marino et al. im- proves DRF0 to the DRFx model, which throws an exception when a data race is detected at runtime [45]. However, we believe that architectural memory models must define clear behaviors for all programs, and even throwing exceptions is not satisfactory enough.

A large amount of research has also been devoted to specifying the memory models of high-level languages, e.g. C/C++ [12–15, 17, 34, 35, 37, 49, 57] and Java [18, 20, 23, 42, 43]. There are also proposals not tied to any specific language [19, 22]. This remains an active area of research because a widely accepted memory model for high-level parallel programming is yet to emerge, while this paper focuses on the memory models of underlying hardware "

-- An Operational Framework for Specifying Memory Models using Instantaneous Instruction Execution

---

from [1] :

hierarchy of common memory consistency model strengths, from strongest to weakest:

a "woefully incomplete" characterization of these memory consistency models: consider reorderings of the following instruction pairs: load/load, load/store, store/load, store/store:

SEQUENTIAL CONSISTENCY:

" 1. All threads are interleaved into a single “thread” 2. The interleaved thread respects each thread’s original instruction ordering (“program order”) 3. Loads return the value of the most recent store to the same address, according to the interleaving

...

For performance, most processors weaken rule #2, and most weaken #1 as well.

...

Q: Can I think of an execution as an interleaving of the instructions in each thread (in some order)? ... That would make it illegal to forward values from a store buffer!.. Because with a store buffer, cores can read their own writes “early”.

Option 1: forbid store buffer forwarding, keep a simpler memory model, sacrifice performance

Option 2: change the memory model to allow store buffer forwarding, at the cost of a more complex model

Nearly all processors today choose #2

Q: Can I think of an execution as an interleaving of the instructions in each thread (in some order), with an exception for store buffer forwarding?

A:

example of the exception for store buffer forwarding:

2 CPUs. Each CPU has 2 threads: threads 1 and 2 on CPU 1, threads 3 and 4 on CPU 2. Thread 1 (on CPU 1) stores a value to memory location A, then Thread 2 reads from memory location A. Starting at about the same time, Thread 3 (on CPU 2) stores a different value to memory location A, then Thread 4 reads from memory location A. If the time it takes for these stores to propagate between CPUs is short, Thread 2 perceives an ordering on which Thread 1's store came before Thread 3's store, but Thread 4 perceives the opposite ordering. So, in this case, there is no interleaving perceived by all threads.

---

" Memory Model Landscape

Sequential Consistency (SC)

Total Store Order (TSO)

Weaker memory models

...

Architects find SC & TSO constraining

(((but))) Programmers hate weak memory models (((because)))...

Difficult to understand, implementation-driven weak memory models ARM, POWER, RMO, Alpha, etc....

" -- [3]

---

it may be useful to look at what was debated in the RISC-V memory consistency model task group, with the heuristic that these items are the 'unsolved' questions in the field, eg things that RISC-V may get wrong, eg complexities to try to stay away from in OVM:

" Items on the agenda currently include, in rough priority order:

...

" PENDING/POSSIBLE CHANGES TO THE MODEL

Feature Status Multi-copy atomicity: Major debate!

Enforce same-address ordering (including load-load pairs): Required! (((?? but see subsequent May 13 mailing list post https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/-p9ch4V9bKM/Ah1Jb_9-BQAJ "More than likely we won't require a fence between two loads if the second is address-dependent on the first...(((but))) I don't think we're quite ready to commit to anything absolute yet.")))

Forbid load-store reordering (for accesses to different addresses), Enforce ordering of address/control/data-dependent instructions, Which FENCE types? (.pr, .pw, .sr, .sw? Other?): Still sorting out the details!

IT’S? ALWAYS SAFE TO BE CONSERVATIVE

" More than likely we won't require a fence between two loads if the second is address-dependent on the first. Practically speaking, a lot of software (notably Linux) basically assumes that hardware always guarantees this to work, because no major architecture since Alpha has relaxed such orderings.

However, we don't yet have 100% consensus on how exactly to formalize this in the task group, and whether address, control, and data dependencies are all equivalent in strength, whether they apply equally well to read-read vs. read-write orderings, or even whether all of the above should even be enforced. So I don't think we're quite ready to commit to anything absolute yet. " -- [6]

"

"

– Architects should pay careful attention to agressive memory access reordering, aggressive

cache cache coherence protocols, and designs that share store buffers between threads. – Hardware should respect all same-address orderings (including load-load pairs) and any orderings established by address, control, and data dependencies.

C/C++ Construct, Base ISA Mapping, ‘A’ Extension Mapping

Non-atomic Load...

atomic load(memory order consume) ld; fence r,rw atomic load(memory order acquire) ld; fence r,rw atomic load(memory order seq cst) fence rw,rw; ld; fence r,rw

Non-atomic Store...

atomic store(memory order relaxed) sd atomic store(memory order release) fence rw,w; sd amoswap.rl atomic store(memory order seq cst) fence rw,rw; sd fence rw,rw; amoswap

Fences

atomic thread fence(memory order acquire) fence r,rw atomic thread fence(memory order release) fence rw,w atomic thread fence(memory order acq rel) fence rw,rw atomic thread fence(memory order seq cst) fence rw,rw

Furthermore, we recommend compiler writers avoid fences weaker than fence r,rw, fence rw, w, and fence rw, rw until the memory model clarifies their semantics. Additionally, while AMOs with both the aq and rl bits set do imply both aq and rl semantics, we recommend against their use until the memory model clarifies their combined semantics.

" -- RISC-V Memory Consistency Model Addendum 2.2

" Weak memory models: Technical issues

Atomic memory systems

(((is the following an example of an atomic memory system, or an example of a NON-atomic memory system?)))

Consensus: RISC-V memory model definition will rely only on atomic memory " -- [9]

from https://www.bsc.es/sites/default/files/public/u1810/arvind_0.pdf :

"Example: Ld-St Reordering Permitting a store to be issued to the memory before previous loads have completed, allows load values to be affected by future stores in the same thread" For example,

Process 1: r1 = Load(a) Store(b,1)

Process 2: r2 = Load(b) Store(a,r2)

Load-store reordering would allow the '1' stored by Process 1 into b to be loaded into r2 by process 2's load, and then stored into a by process 2's store, and then loaded into r1 from a by process 1! Implementation-wise, here is what could happen:

"

" Load-Store Reordering

Nvidia says it cannot do without Ld-St reordering Although IBM POWER memory model allows this behavior, the server-end POWER processors do not perform this reordering for reliability, availability and serviceability (RAS) reasons MIT opposes the idea because it complicates both the operational and axiomatic definitions, and MIT estimates no performance penalty in disallowing Ld-St reordering

Nevertheless MIT has worked diligently to come up with a model that allows Ld-St ordering (((perhaps e meant REordering? also, MIT's primary proposed model, WMM, prohibits load-store reordering, so note that they are talking about their 'Model X' here, which is detailed later in the slides))) " -- [10]

" C++ operations, WMM instructions Non-atomic Load / Load Relaxed: Ld Load Consumed / Load Acquire: Ld; Reconcile Load SC: Commit; Reconcile; Ld; Reconcile Non-atomic Store / Store Relaxed: St Store Released /Store SC: Commit; St

Compilation from C++11 to WMM C++11 introduces atomic variables in addition to the ordinary (non-atomic) ones

" RISC-V memory model debate is not settled; in spite of lot of research by the Memory Model Committee (Chair Dan Lustig), the community may vote for TSO "

---

figure 1 from [11], broken into parts, and with some details and notes omitted:

Operational model, Axiomatic model:

this suggests that we should restrict our attention to:

which are the ones which have both simple operational and simple axiomatic models

"reasoning (((about))) partially executed instructions...is unavoidable for ARM and POWER operational definitions." [12]

the rest of figure 1 from [13], for these rows only, with some details and notes omitted:

Store atomicity, Allow shared write-through cache/shared store buffer, Instruction reorderings, Ordering of data-dependent loads:

(note: this group's alternative proposal, WMM-S, for which the operational model complexity was 'medium' and no axiomatic model was provided, has non-atomic store atomicity; i think that one point that the authors may be trying to make is that you want at least multi-copy atomicity for clean semantics; in the paper's conclusion they say "Since there is no obvious evidence that restricting to multi-copy atomic stores affects performance or increases hardware complexity, RISC-V should adopt WMM in favor of simplicity.". Elsewhere in the table, in row 'ARM and POWER', store atomicity is classified as 'Non-atomic', although note that [14] says that ARMv8.2 is "(other-/weak-)multi-copy atomic" as opposed to POWER and GPU which are "not multi-copy atomic"). Indeed [15] says "The...manuals for ARMv7 and early ARMv8 described a relaxed memory model, with programmer-visible out-of-order and speculative execution, that was non-multicopy-atomic... The ARMv8 architecture has therefore been revised: it now has a multicopy-atomic model."

multi-copy atomicity is defined here:

" In this paper we propose two weak memory models for RISC-V: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. The difference between the two models is regarding store atomicity, which is often classified into the following three types [19]:

The abstract of [17] defines "non-multicopy-atomic" as "writes could become visible to some other threads before becoming visible to all".

So this suggests that we should restrict our attention to:

which have the common characteristics of at least the following model strengths:

and of which at least one of which permits the following model weaknesses:

---

more background on WMM; the paper says

" The memory model for RISC-V, a newly developed open source ISA, has not been finalized yet and thus, offers an opportunity to evaluate existing memory models. We believe RISC-V should not adopt the memory models of POWER or ARM, because their axiomatic and operational definitions are too complicated. We propose two new weak memory models: WMM and WMM-S, which balance definitional simplicity and implementation flexibility differently. Both allow all instruction reorderings except overtaking of loads by a store. We show that this restriction has little impact on performance and it considerably simplifies operational definitions. It also rules out the out-of-thin-air problem that plagues many definitions. WMM is simple (it is similar to the Alpha memory model), but it disallows behaviors arising due to shared store buffers and shared write-through caches (which are seen in POWER processors). WMM-S, on the other hand, is more complex and allows these behaviors. " -- https://arxiv.org/pdf/1707.05923.pdf

---

Instantaneous Instruction Execution (I2E) formalism/model invented by MIT group working on RISC-V memory consistency model for formalization of memory consistency models

"

SC in I2E:

TSO in I2E:

Simple and vendor-independent

TSO allows loads to overtake stores " -- [18]