proj-oot-ootAssemblyNotes25

(June 30 2019)

So i'm mostly happy with Boot right now (even though it's not fully implemented or specified and what is implemented and specified is not quite in agreement with my latest plans esp. regarding redoing the jump label stuff to make stuff compilable; i do plan to fix that sometime soon), and i've stopped focusing on it and moved up to thinking about OVM.

I still have two issues/decisions to make about potentially changing Boot though. I'm trying to think if we could make things even simpler:

Both of these potential changes come out of the same line of thinking:

Now that i'm thinking about OVM, Boot's function as an easy-to-implement portability layer is coming into focus. If people want something really easy to implement, they will implement Boot, not BootX?. And if they want to do more work to get better interop or performance, they will implement OVMlow, not BootX?. So really, the purpose of BootX? is some sort of project-internal efficiency or readability thing, not something that porters will really be concerned with. Except, right now if they want to add floating point functionality, or 64-bit int functionality, or many of the syscalls, they have to switch to BootX?. Which is why i'm thinking of (a) making Boot an all-inclusive, one-stop-shop for porters; so that if they implement Boot, they're done; and (b) making Boot even simpler by switching to the 32-bit representation.

In essence, 'Boot' would be its own layer, and BootX? (although still separately specified and probably separately useful for other projects) would from the point of view Oot porters be though of as 'part of that OVM layer stuff that we don't have to think about'.

Otoh, there's also something to be said for the current setup:

Either way though, i'm pretty happy with Boot. Which is why i'm thinking more about OVM.

---

For source code without any atomics/synchronization/fences (for example "a = 3; b = a + 2; print(b)"), i'd like to compile this code in the most obvious way on each platform (for example, on many platforms the previous example could compile to placing a constant value into a location, adding another constant value to that location, and then making a function call with the location as argument, with no fencing/memory barriers/synchronization at all). But if my language's memory ordering consistency model is stronger than that of some target platform, then in order to deliver the guarantees of my language's memory ordering consistency model, in some cases i may have to insert some sort of synchronization into the compiled code on that target platform.

To avoid having to think about that when porting to a new platform, if possible, i'd like to choose a sufficiently weak memory ordering consistency model for my language, so that on all of the likely target platforms the model's guarantees are delivered without me having to further understand the target platform's memory consistency model, at least for source code without synchronization. (in fact, the previous sentence is actually a little optimistic; the pragmatic reality is that i won't understand the target platform's memory consistency model very well and so i will just compile all unsynchronized code 'in the obvious way' without any thought at all, regardless of whether or not that fits the guarantees of the language's model; i'm just trying to choose a memory consistency model for the language the justifies this behavior, so that users of the language aren't surprised. Of course i could just say "the ordering of events (as observed by any thread except the thread that generated the events) is completely undefined in the absence of explicit synchronization" but that seems a bit much.).

Some major target platforms are RISC-V, C, ARM, x86-64.

So i figure that what i should do is make my language's memory consistency model something close to the strongest thing that is still weak enough such that meeting the guarantees of any of those major target platforms' memory consistency models implies that the guarantees of my language's memory consistency model are met.

So which one is weakest? Rather, are any of the those platforms' memory ordering consistency models weaker/stronger than any of the others for code without atomics, fences, memory barriers, or synchronization. And more precisely, given an arbitrary piece of unsynchronized code (perhaps barring pathologically platform-specific code), if the same code were translated into each of these platforms, the certain guarantees on ordering would apply on each platform depending on which memory ordering consistency model that platform offers -- is there any pairs of the platforms in this list such that one member of the pair will always satisfy not just its own guarantees, but also all of the guarantees promised by the other member of the pair?

eh, i think that's not QUITE my question, although it mostly is.

First, i'm willing to accept a not-completely-relaxed memory model provided that it seems like it's something that "everyone's doing" (that is, it's common enough that if i end up porting to a platform with a weaker model then i'll probably know, because the fact that this platform has such a weak model will be something is so unusual that people comment on it).

[1] suggests that ARM has Release Consistency (RC), and that link and also [2] suggest that x86 has something 'close to TSO'. And i sorta recall from the RISC-V discussion that TSO is stronger than RC. And https://github.com/riscv/riscv-isa-manual/blob/master/src/memory.tex says that RISC-V's memory model is called RVWMO and is a variant of release consistency.

So that suggests that, yeah, RISC-V's memory model is the weakish one that i seek.

The current RISC-V spec draft says: https://github.com/riscv/riscv-isa-manual/releases/download/draft-20190626-9f0e234/riscv-spec.pdf

"although the ISA does not currently contain native load-acquire or store-release instructions, nor RCpc variants thereof, the RVWMO model itself is designed to be forwards-compatible with the potential addition of any or all of the above into the ISA in a future extension. "

so is it really even a RC model?

The RISC-V FENCE instruction talks about whether reads or writes are being ordered, but not about acquire vs release.

Which C11/C++11 memory order corresponds to RC? You'd think it would be using memory_order_acquire and memory_order_release, or mb memory_order_acq_rel.

Is a RISC-V FENCE a C11 memory_order_acq_rel?

The RISC-V memory model also lets you choose between the PC (processor consistency) and SC (sequential consistency) variants of RC (that is, are the RC atomics ordered with respect to each other using PC or SC?).

Confusingly, in https://stackoverflow.com/questions/16179938/c11-memory-order-acquire-and-memory-order-release-semantics , two people think that the C folks totally got the meaning of the RC memory model wrong; but looking at https://en.cppreference.com/w/cpp/atomic/memory_order , it looks like the wording they are complaining about is not there, so perhaps it was somehow fixed already?

Also, https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering states that Release-Consume ordering (dependency ordering) is implicit "On all mainstream CPUs other than DEC Alpha". Does this mean that it holds between all memory accesses even ones in code with no synchronization (my guess is, yes, it does mean that). Does RISC-V share this property?

So some other questions (todo read back over this section and add more):

(a little later): ok i researched some more and got some more info:

stop the presses: the spec chapter on the Atomics extension has some stuff to say here: "The AMOs were designed to implement the C11 and C++11 memory models efficiently. Although the FENCE R, RW instruction suffices to implement the acquire operation and FENCE RW, W suffices to implement release, both imply additional unnecessary ordering as compared to AMOs with the corresponding aq or rl bit set." and "To provide more efficient support for release consistency [5], each atomic instruction has two bits, aq and rl, used to specify additional memory ordering constraints as viewed by other RISC-V harts...If both bits are clear, no additional ordering constraints are imposed on the atomic memory op- eration. If only the aq bit is set, the atomic memory operation is treated as an acquire access, i.e., no following memory operations on this RISC-V hart can be observed to take place before the acquire memory operation. If only the rl bit is set, the atomic memory operation is treated as a release access, i.e., the release memory operation cannot be observed to take place before any earlier memory operations on this RISC-V hart. If both the aq and rl bits are set, the atomic memory operation is sequentially consistent and cannot be observed to happen before any earlier memory operations or after any later memory operations in the same RISC-V hart and to the same address domain."

mb also read relevant sections of "Exploring the Trisection of Software, Hardware, and ISA in Memory Model Design"

so if i'm right, it sounds like RVWMO is good, and FENCE-sq is good, and we need to add acquire_sc &mem and release_sc &mem.

---

mb should draft a post to a RISC-V memory model guru or email list to ask my questions.

Hello, i'm designing a VM for a higher-level language, and i'm trying to understand some of the RISC-V memory consistency model. The VM may be implemented in RISC-V but also in other platforms such as C, so i'm trying to come up with instructions for the VM that can be implemented on various platforms, while also understanding how to use those VM instructions to guarantee data race freedom and/or sequentially consistent execution of programs written on top of the VM.

My current idea (suggestions welcome!) is to have 5 classes of memory instructions in my VM:

My goal is to choose a set of memory instructions for my VM that can be described by a formal memory model, and can allow the expression of efficient algorithms in the future (i.e. to allow programs to be written on the VM without unnecessary synchronization), but then to write the initial implemention of the VM itself in a conservative (oversynchronized) manner (to save myself trouble during the initial implementation). I also want to understand how to easily write programs in the language of the VM in a conservative (oversynchronized) manner, again to save myself from having to debug concurrency bugs before i get far enough along to care about performance.

In other words I want to create VM instructions that can express weaker ordering, but for the time being I want to implement these in a conservative manner by actually providing a stronger ordering than the instructions guarantee.

I had a few questions about RISC-V and about how the RVWMO model relates to others.

1) Is there a guide somewhere for how to implement C/C++ atomics on RISC-V, and if not, is there a recommended/correct implementation/C/C++ compiler whose source code could serve as a tutorial?

2) In RISC-V, in order to blindly make any program sequentially consistent, is it sufficient to place a FENCE R,RW after each load and place a FENCE RW,W before each store?

3) If one was trying to implement a FENCE instruction with similar or stronger guarantees as RISC-V's FENCE (excluding the I/O and TSO functionality), how could one implement that in C or C++, using C atomics?

4) How does C/C++'s https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence, which has a memory_order argument, relate to RISC-V's FENCE, which has flags like rw,rw?

5) On a RISC-V platform, how should one implement C/C++'s atomic_thread_fence with memory_order=seq_cst?

6) If RVWMO is multi-copy atomic then why are RCpc annotations even mentioned in the draft RVWMO chapter of the spec? isn't the whole point of Processor Consistency that it is not multi-copy atomic?

7) (i guess this question isn't really RISC-V specific) On any platform whose only memory operations are: * non-synchronized LOAD/STORE instructions, and also * synchronizing RCpc LOAD-ACQUIRE and STORE-RELEASE instructions, if data races are defined as a pair of unsynchronized instructions at least one of which is a write, and, if a particular program can be shown to have sufficent use of synchronizing instructions to eliminate all data races (that is, to induce an ordering over every pair which would otherwise be a data race, under the formal memory model of the platform), then: is that program guaranteed to be sequentially consistent on that platform?

8) Does the addition of a CAS instruction to the platform change the answer to the previous question?

thanks, bayle

OK stop the presses again, there is an Appendix A in the draft spec that gives tables of how to implement the C atomics in terms of RISC-V instructions.

Two questions i had in the appendix was:

---

probably need load_acquire and store_release variants of load and store? so might have to get rid of 'bne' to make room :(. Alternately (probably better), have 1-argument fences:

load_acquire &mem store_release &mem

---

should memory_domain be an integer (like an I/O channel descriptor) or a pointer (so we can point to a memory_domain descriptor in memory)? i guess it should be like an I?/O channel descriptor. All of POSIX, Windows, Plan 9 appear to use integer file descriptors (well, Windows' "HANDLE" data type is an opaque 32-bit value but i think it's internally used as an integer index into an object table), so i guess we should make it an integer too.

---

7 of the 16 possible #fence_type combinations in FENCE seem to be meaningless, as they imply that either the prior flags or the subsequent flags are all empty (, r, w, rw, ,r ,w ,rw). Do these have any effect? If not, should we use them for something else?

---

i changed it back to say that there is undefined behavior ('UB') on arithmetic overflow. This is because C has UB on signed arithmetic overflow, and we want to be able to just directly compile to C without additional checks.

also we want to allow the implementation to replace our int32s with register-width (often 64-bit) quantities, which means that we can't guarantee wraparound arithmetic:

"TL;DR: C doesn't work very well if int!=register width, but (for backwards compat) int is 32-bit on all major 64-bit targets, and this causes quite hairy problems for code generation and optimization in some fairly common cases. The signed overflow UB exploitation is an attempt to work around this." [16]

---

the more information we put into registers that doesn't really need to be there, the fewer 'real' registers we have, which means the less stuff that we'd want to map into 'real' registers on, say, and x86_64 assembly implementation, which is nice because the implementation needs a few x86_64 registers for its own use.

In addition, it's nice because then each thread costs less memory, so we can have more Oot threads in total.

some more information that we COULD put into registers for this purpose would be:

PC in r15, interrupts (flags that say if an interrupt of 16 types are pending, as well as if interrupts of those 16 types are masked) in r14, misc. status register (for future use; stuff like what protection ring are we currently running in) in r13.

we could also cache the top two stack items instead of just TOS.

should also look at what else the erlang process control block has.

---

recall that the reason for just passing arguments on smallstack in the calling convention was that if you have caller saved registers anyways, you may as well use them for arguments.

---

apparently we want callee cleanup for tail calls, i've heard

---

A nice little machine might have:

What i'm suggesting here is replacing SMALLSTACKs with even smaller ones that fit wholly in registers. This makes them unusable for macrocompiliation but for ordinary execution it would be fast.

The reason I mention these is that it's nice to have stack wholly in regs on x86_64. If the stack is wholly in regs then you can do stack pushes/pops without touching main memory.

4 stack locs is minimum to have: return address, two args, and a place to put a return value from a subroutine (without clobbering arguments). This suggests that at least 3 stack locs should be cached in registers (so that you could push a return address and two arguments before calling a function).

Also, if we have at least 2 stack locs cached in registers, then we can do a SWAP on the stack by using a register SWAP operation; so we wouldn't need separate stack SWAP operations (otoh we can already do that by doing register SWAP on $S, i guess).

Since you can still do normal register ops on the 3 stack locs, this isn't really much worse than just having 8 registers; a register-oriented program could just ignore the stack and use those as registers (it's a little worse if the program does use the stack though, because then it cant use the two new stack registers as ordinary registers, because they'd get clobbered; so now there are only 3 really-registers instead of 5, which is a big difference). And we can still provide 16 registers as normal in the R2, R1 instruction encodings.

To push in this machine, you would write to S (R1). Let's say the stack locs are in registers R2,R3,R4. Pushing would have the effect of

and popping would have the effect of:

If we specify further that the value of R4 MIGHT be saved somewhere else when you push, then this could be extended to a partially-cached, in-memory stack without breaking anything (popping already has 'undef R4' so if R4 is getting some deeper stack item that's okay).

So we could specify that in Boot and BootX?, the stack works like this, and then in OVM (where we want bigger smallstacks for macrocompiliation), we could expand smallstack to size 16 or 32. Then when Boot instructions are executed (including in OVM), we could just not spill the stack (which gives us no memory access delay on stack ops), and then when we are in OVM mode, we could.

hmm, i kinda like that.. we keep some of the encoding density benefits of having a stack, while getting much faster execution.

otoh is it really that much faster? according to [17] and [18] and [19], L1 cache latency is not much worse than register; [20] says it is about 3x as slow though, but then talks about write-back and store-forwarding which reduce that. According to [21] there is about a 6-cycle latency for L1 cache. I guess that's significant (especially since to push or pop the stack we'd have to access L1 cache at least three times; once to read the stack pointer or depth, once to spill or unspill the stack, then once to write the new stack pointer or depth back).So we should still expect a significant speedup this way.

even better, when using macroinstructions, what if we specify that stack locations above the 'current stack frame' shouldn't be accessed by the macroinstruction? Now we can give each macroinstruction 3 fresh smallstack locations (and maybe we put the arguments in there too, and they return in there too) and they won't be allowed to access the caller's smallstack locations. As long as the macroinstructions never need to access the caller's stack, now we can do the macroinstruction thing without making smallstack bigger (and having to spill the stack to memory during execution, except at the point where we call/return from macroinstructions).

does this means that macroinstructions are distinguished from ordinary instructions in that they can't take operands of R2, R3, R4? No, because remember, the VM is copying the operands into/out of the smallstack when it calls the macroinstruction anyway; so if say, R3, is given an an argument operand, the macroinstruction will find a copy of the value of R3 in its smallstack, and if it's the destination operand, the result will be copied into it afterwards.

what this does mean is that the macroinstruction implementation itself needs another stack, into/from which the smallstack is being spilled upon macroinstruction call.

so does this mean that now writing OVMlow in Boot is harder? Well, what we can do is incorporate the other stack into Boot, and add explicit PUSH and POP instructions which are like pushing/popping via S, except that they do spill to the rest of the stack. So the S registers pushes/pops the small small stack, and push and pop work on the smallstack. Of course these would be distinct from push and pop that would work on memory stacks.

hmm, i kinda like this!

otoh, if we do this, is the S register really that useful? Macroinstructions can only access 3 stack slots anyways, so why not just use them as registers always and have S spill to memory like it currently does? Is the moving-stuff-around that a stack provides that useful if we only have 3 slots? I guess for the 8-bit encoding it helps a lot to have an S, but does it help a lot to have a faster S with only 3 slots?

in a way, the main gain is just that, because 3 stack locs are cached instead of only 1, the macroinstructions can do their thing without expensive stack spills -- even if the meaning of S is unchanged. So it might be worth it to cache the top 3 stack slots even if we don't have a way to push/pop only to those slots.

but my tentative guess is that, yes, changing the meaning of S will be worth it too.

---

so:

ok i think it's not worth it to change the definition of S. The times you really want a stack is when there is something like function calling, anyways.

of course, even 3 regs isn't enough to hold the 4 operands in the 64-bit encoding. But presumably one of those will often be the destination register.

---

since $0 doesn't need to be stored, a Boot x86_64 implementation can store the PC there. But it probably wants some more registers, in order to store its own stackpointer, and a few temporary registers.

so i think two of the 3 GPRs are the ones that should probably go into memory: $6, $7 (for two temporaries) and &6, &7.

That gives the implementation 5 registers to work with, which should be enough for the PC, a stackpointer, and 3 temporaries.

---

Displacement mode, if we have one in BootX?, should be relative to r3, not tos.

Should we use a frame pointer for displacement mode instead of stack ptr? That would mean putting a FP in R3.

---

other name ideas: gemini biplane twin deuce janus bivalve binocular duplex duo bifocal bifold

---

i noticed that:

This all suggests to me that a strong/sequentially consistent (SC)/RW fence (presumably like Power's sync, ARMv7's dmb, RISC-V's FENCE RW,RW, RC11's fence with mode seq_cst) is the most common/important type of fence.

When one sees that Boot has a FENCE instruction with modes like R,RW available, you start thinking about a bunch of hairy complexities like what the ramifications of that are, how it related to C's fences with modes like seq_cst, etc. A Boot implementor could overlook the other types of fences and just upgrade them to a FENCE RW,RW, but before they do that they'll probably spend time trying to figure this other stuff out and get annoyed at how complicated it is, and worried that they don't understand it (at least, that what i did); so let's not make them think.

So i'm removing the modes to the Boot FENCE instruction; there will only be one FENCE instruction, it will be RW and seq_cst.

now i can remove the following notes from the spec, which were marked t o d o:

is this true: a FENCE rw,rw is like (?or mb, at least as strong as?) a C atomic_thread_fence with memory_order_seq_cst do fences synchronize with other fences on the same memory_domain in an SC manner? 7 of the 16 possible #fence_type combinations in FENCE seem to be meaningless, as they imply that either the prior flags or the subsequent flags are all empty (, r, w, rw, ,r ,w ,rw). Do these have any effect? If not, should we use them for something else?

---

from my concordance, i see that "Note that the intersection of these C/C++ AMOs with the set of AMOs provided by RISC-V and LLVM is: swap, add, and, or, xor."

we already have swap (CAS). So i just now added:

add_amo ap_amo and_amo or_amo xor_amo

i also added sequentially consistent (SC) synchronization variants of these:

{add, ap, and, or, xor}_amo_sc

and also cat_sc

that's a lot of opcodes wasted on synchronization, though, i dunno if i like that. I think those guys could be built from the atomic load/store and fence and cas primitives anyways? Maybe this stuff should go in the OVM layer. Also, although Bridging the Gap between Programming Languages andHardware Weak Memory Models has both CAS and FADD AMOs (figure 2. Programming language syntax), appears to Repairing Sequential Consistency in C/C++11 appears to have only CAS.

Also, if the platform we are on has only CAS and doesn't support ll/sc, it's not immediately obvious to me how to implement add_amo without a runtime that reserves extra memory for a lock or somesuch. https://preshing.com/20150402/you-can-do-any-kind-of-atomic-read-modify-write-operation/ proposes a way but i think that would e.g. sometimes lose concurrent increments due to the ABA problem (e.g. if two threads are doing an atomic add at the same time, the shared variable might only be incremented once). Oh wait a minute, no it's fine; if another thread does an increment in between when the shared variable is fetched and the CAS, then the CAS fails the precondition; and if, in between the fetch and the CAS, one other thread increments, then another thread decrements, the precondition succeeds but that's fine because the semantics of atomic addition is fulfilled on the resulting state.

ok in that case i guess it does make sense to have these in here. The CAS implementation of the relaxed memory order variants is so simple that we can leave it to the OVM level; but it may make sense to have the SC memory order variants, because although (i think?) those can be conservatively implemented by surrounding them with FENCE instructions (?is that correct?), some platforms (such as RISC-V) might offer more performant/fine-grained variants (that synchronize only those AMOs, rather than synchronizing everything like a fence).

---

should our ldrc and strc have RCpc or RCst semantics?

"PLandDRF1allowforalloptimizationsofWOandRCsc[2,5."

2.S.V.AdveandM?.D.Hill,AUnifiedFormalizationofFourShared?-MemoryModels?,ComputerSciencesTechnicalReport?#1051,UniversityofWisconsin?,Madison,September1991.SubmittedforPublication?.

5. K.Gharachorloo,D.Lenoski,J. Laudon,P.Gibbons,A.GuptaandJ?. Hennessy,MemoryConsistencyandEventOrderinginScalableShared?-MemoryMultiprocessors?,Proc.17thAnnualIntl.Symp.onComputerArchitecture, May1990,15-26

" Ournewmemorymodel,calledPLpc3, extendsourpreviousframeworktoallowtheprogrammertoreasonwithSC,whileallowingtheadditionaloptimizationsofTSO,PSO,PCandRCpcaswell?.It achievesthisbyrequiringtheprogrammertoadditionallyidentifya commonclassofsynchronizationaccesseswhereoneprocesswaitsona locationuntila particularvalueis writtenbyanotherprocess. ... Wefurthercategorizethecompetingaccessesasloopornon-loop.Figure2 summarizesthiscategorization.Althoughourdefinitionsmayseemcom-plex,theintuitionis fairlysimpleandtheinformationwerequire,i.e.,whetheranaccessis competingornotandwhetherit is a loopaccessornot,is naturallyavailabletotheprogrammerinmostcases. ...

Definition4:LoopandNon?-LoopReads:A competingreadis aloopreadif (i)it is thefinalreadofa synchronizationloopconstructthatterminatestheconstruct,(ii)it competeswithatmostonewriteinanySCexecution,and(iii)if it competeswitha writeinanSCexecution,thenit returnsthevalueofthatwrite;i.e.,thewriteisnecessarytomakethesynchronizationloopconstructterminate.Acompetingreadthatis nota loopreadis anon-loopread.Definition5:LoopandNon?-LoopWrites:Acompetingwriteisaloopwriteif it competesonlywithloopreadsineverySCexecution.A competingwritethatis nota loopwriteis anon-loopwrite "

"However,sincemanycompileroptimizationsrequirethefullflexibilityofreorderingbothreadandwriteaccesses,webelievetheextragainfromthelimitedreorderingbetweencompetingaccessesasallowedbyPLpcwillbemoreprominentinhardwarethaninthecompiler."

kind of hard to read without spaces but i think the important parts are:

this suggests that we should just switch to RCsc, and furthermore than RCpc is hard to provide a DRF-SC like guarantee for.

hmm, actually these Armv8.3-A release notes say that C/C++ acquire/release/acq_rel memory orders are RCpc, not RCst:

" Memory consistency model

The Armv8.0 support for release consistency is based around the “RCsc” (Release Consistency sequentially consistent) model described by Adve & Gharacholoo in [1], where the Acquire/Release instructions follow a sequentially consistent order with respect to each other. This is well aligned to the requirements of the C++11/C11 memory_order_seq_cst, which is the default ordering of atomics in C++11/C11.

Instructions are added as part of Armv8.3-A to support the weaker RCpc (Release Consistent processor consistent) model where it is permissible that a Store-Release followed by a Load-Acquire to a different address can be re-ordered. This model is supported by the use of memory_order_release/ memory_order_acquire /memory_order_acqrel in C++11/C11.

[1] Adve & Gharachorloo Shared Memory Consistency Models: A Tutorial " -- Armv8-A architecture: 2016 additions

---

just for future note, i always say that Oot doesn't care about performance, so why does Boot have non-atomic, relaxed, and release concurrency memory instructions? Wouldn't it be easier to only expose sequential consistency?

There are two answers. One is that Oot doesn't care about performance, but Boot does, because it's lower-level.

The second answer is that Oot is especially focused on concurrency, so i'm erring in the direction of making the Oot ecosystem supporting a wide variety of concurrency primitives.

---

i think the memory consistency model specification is getting too complicated and i should adopt something more like RVWMO with multi copy atomicity.

i think in fact, the implementation that would violate non-multi-copy-atomicity is not even a distant server, but rather shared write-back caches shared between some but not all clients of that server; now clients who share a write-back cache see each others' writes before everyone else, b/c they see those writes before the server does! so there is sort of not even a single source of truth! i think multi-copy-atomicity can coexist with distant servers and network delays (as much as anything with a strong FENCE can, at least, and we will have that).

(not sure about what i just said there though...)

---

yeah, i think that's right; the multi-copy-atomicity constraint is that all other processors besides the one doing the write see the write at the 'same time'. But we're not talking clock time here, we're talking ordering vs. other events. So a big delay to reach the server is fine; all we need to prevent is the server somehow receiving the write, noting it and serving it to some clients, yet still serving the old value to other clients. One could imagine that some sort of inconsistent (in the CAP sense; CAP consistency is 'Every read receives the most recent write or an error') database would fail this. But i imagine that an inconsistent database could have other problems here (like not being able to properly implement FENCE), depending on what kind of inconsistency it was (that is, what the database's consistency model is).

anyhow, i started to sketch out a formal memory model and i quickly found it getting nearly as complicated as C++'s, which i think is too complicated. The only thing keeping me from using RISC-V's RVWMO model was my desire to keep it more flexible for relaxed implementations (eg ones without multi-copy atomicity), but i don't think this is worth all that complexity. And if we can just use RVWMO model then we have the huge benefits that the model was designed by experts, got lots of peer review, has already had related work done like the tables in the RISC-V spec telling you how to port memory ordering instructions for various other platforms to RISC-V and tooling like the excellent RMEM and herd which apparently has a RVWMO model [22], and has a community that we can go to with questions.

and after looking at this longer and looking at C++'s formal memory model, etc, i think the RISC-V one, as complicated as it is, is simpler than most others (maybe the Golang model is even simpler but not by much, and it's at a slightly higher level of abstraction). So multi-copy atomicity probably really is required for any sort of simplicity.

so i'm going to throw out my formal memory model sketch (actually, i'll copy it somewhere in case i change my mind) and go with RVWMO.

still have to decide on the discrepencies. For example, since we have the concept of non-atomic updates which RISC-V doesn't have except for mixed sizes, which it admits aren't really formalized well yet anyway, does their presence mess everything up? Or does our 'data races undefined' fix that? Currently, our 'data races undefined' constraint only requires no data races in every SC execution, but should we strengthen it to require no data races in every RVWMO execution? Probably, yes.

--- as strong or stronger. If someone's already thought about this in code or

--- ok new questions to maybe ask the RVWMO folks:

Hi, the excellent RISC-V unprivileged spec, Appendex A, described how to implement C++ atomics in RISC-V, but i'm curious about the reverse problem.

i'm thinking about how one could implement RISC-V-style memory operations in C or C++ using C14/C++14 atomics, such that the new memory operations would provide the same guarantees as the corresponding RISC-V ones do; so that, if i used these new operations for all concurrency in some C++ program, that C++ program would obey all of the guarantees of the RVWMO memory model.

I'm particularly interested in how to map the following 9 RISC-V operations into C++ atomics (assume that all operations on shared memory are always 32-bit, aligned, and use only the new RISC-V style operations). As a start, i've provided the reverse of the mappings in Table A.6. This isn't necessarily the right answer, because maybe Table A.6 mapped some weaker C++ constructs to stronger RISC-V ones, but i need to map each RISC-V instruction to a C++ atomic which is at least as strong or stronger. If someone's already thought about this in code or in a paper, please point me to it.

So, in the mapping that i just gave, which items are wrong? For example, i assume that C++ memory_order_relaxed isn't multi-copy atomic, so does that mean that RISC-V lw shouldn't be mapped to C++ atomic_load(memory_order_relaxed) like i did above? The larger question is, in which items above is the C++ replacement on the right of the arrow weaker than the RISC-V instruction on the left of the arrow, taking into account that i want all of the guarantees of the RVWMO memory model to be upheld in the resulting C++ program?

thanks, b

---

save this other one for later:

1) in RISC-V atomics extension ('A'), the .aqrl mode of AMOs are specified to be sequentially consistent -- was this a choice, or a consequence of RVWMO? For example, in C++, there is memory_order_acq_rel, which is weaker than memory_order_seq_cst. Would it have been possible in RVWMO to make a mode of AMO which was both an acquire and a release, but not sequentially consistent (SC), or does the nature of RVWMO imply that anything which is both an acquire and a release is always also SC?

---

note to self: in that memory consistency order part in ootAssemblyNotes18.txt, the mnemonic that RISC-V was thinking of using for RCsc lr was lr.aqrl.

---

in my analysis of RISC-V Geneology in the Concordance, here are 15 core instructions:

This is 15 instructions.

This suggests a simpler 16-bit encoding (swap out BEQ for BLT; and in the interpretation of ADDI's constant, have spots for INT32_SZ and PTR_SZ)... of course we want that darn encoding length format bit. We could add a MORE instruction and stuff the shifts, logicals, SUBs, and SCALL into 2-operand instructions under the MORE. Or, we could do what we did, and reduce the number of registers from 16 to 8, freeing up 3 bits, one of which is spent on encoding length format, with 2 bits left.

---

Yesterday I asked RISC-V isa-dev how it would be possible to write a program in C++ respecting RVWMO:

https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/SoJMEZSK700

--

Hi, the excellent RISC-V unprivileged spec, Appendex A, described how to implement C++ atomics in RISC-V, but i'm curious about the reverse problem.

i'm thinking about how one could implement RISC-V-style memory operations in C or C++ using C14/C++14 atomics, such that the new memory operations would provide the same guarantees as the corresponding RISC-V ones do; so that, if i used these new operations for all concurrency in some C++ program, that C++ program would obey all of the guarantees of the RVWMO memory model.

I'm particularly interested in how to map the following 9 RISC-V operations into C++ atomics (assume that all operations on shared memory are always 32-bit, aligned, and use only the new RISC-V style operations). If someone's already written a C++ library for this or thought about it on a mailing list or in a paper, please point me to it. As a start, i've provided below the reverse of the mappings in Table A.6. This isn't necessarily the right answer, because maybe Table A.6 mapped some weaker C++ constructs to stronger RISC-V ones, whereas i need to map each RISC-V instruction to a C++ atomic which is at least as strong or stronger.

(btw, should i have written lw.aqrl and sw.aqrl for the RCsc hypotheticals?)

So, in the mapping that i just gave, which items are wrong? For example, i assume that C++ memory_order_relaxed isn't multi-copy atomic, but i want my C++ program to obey RVWMO just like RISC-V does, so does that mean that RISC-V lw shouldn't be mapped to C++ atomic_load(memory_order_relaxed) like i did above? The larger question is, in which items above is the C++ replacement on the right of the arrow weaker than the RISC-V instruction on the left of the arrow, taking into account that i want all of the guarantees of the RVWMO memory model to be upheld in the resulting C++ program?

thanks, bayle

--

(WHY did i misspell 'Appendix' in the very first line?!? i should have used a spellchecker)

No replies yet. The more i think about it, the more i think that RVWMO'ss multi-copy-atomicity property can only be acheived (blindly/without knowing which program we are compiling) by using only memory_order_seq_cst in C++.

The RISC-V spec says: "The RVWMO memory model is defined in terms of the global memory order, a total ordering of the memory operations produced by all harts. In general, a multithreaded program has many different possible executions, with each execution having its own corresponding global memory order. The global memory order is defined over the primitive load and store operations generated by memory instructions."

So it seems to me that ALL memory operations participate in this global memory order.

How is having a global memory order different from sequential consistency? Lamport defined 'sequential consistency' as "... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program."[1]

It seems therefore that RVWMO satisfies the first condition of 'sequential consistency' but not the second; as the spec says, "The global memory order for any given execution of a program respects some but not all of each hart’s program order.". Which is why the rest of the RVWMO model spec focuses on defining WHICH parts of program order are 'preserved' in the 'preserved program order' ("The subset of program order that must be respected by the global memory order is known as preserved program order.").

So, in RVWMO, ALL memory operations, not just ones with RCsc annotations, not just synchronization ops, are in a global memory order.

How do different processors ever observe things in different orders then? I think it's because the 'Load Value Axiom' allows a read to read EITHER the last write before the read in the Global Memory Order, OR the last write before the read in Program Order.

Now let's look at C++. I don't have the actual standard, but here's a draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf

The 'memory_order's are defined in clause 29.3, page 1145. 29.3.3 on page 1146:

" There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happensbefore” order and modification orders for all affected locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values: —(3.1)the result of the last modificationAofMthat precedesBinS, if it exists, or —(3.2)ifAexists, the result of some modification ofMthat is notmemory_order_seq_cstand that does nothappen beforeA, or —(3.3)ifAdoes not exist, the result of some modification ofMthat is notmemory_order_seq_cst "

and 29.3.8 says:

" Note: memory_order_seq_cst ensures sequential consistency only for a program that is free of data racesand uses exclusivelymemory_order_seq_cstoperations. Any use of weaker ordering will invalidate thisguarantee unless extreme care is used. In particular,memory_order_seq_cstfences ensure a total orderonly for the fences themselves. Fences cannot, in general, be used to restore sequential consistency for atomicoperations with weaker ordering specifications.— end note] "

And i don't see anything about a single total order for the other non-seq_cst memory_orders. There are a bunch of statements about how seq_cst fences order atomic operations even if they are non-seq_cst however; but the note at 29.3.8 says that it's hard to make use of that to order everything else. How could this be?

The first answer to [23] suggests that the problem is that "observing thread can simply observe in a different order, and will not matter what fences you are using in the observed thread", that is, if the observing thread has a weaker memory order then the seq_cst fence will order writes but another observing thread may in some cases disregard this ordering.

https://plv.mpi-sws.org/scfix/paper.pdf also goes into this in Section 2.2, "Second Problem: SC Fences are to weak". They say that "In general, one might expect that inserting an SC fence between every two instructions restores sequential consistency. This holds for hardware memory models, such as x86-TSO, Power, and ARM, for programs with aligned word-sized accesses (for their analogue of SC fences), but holds neither in the original C11 model nor in its strengthening [5]...". They give a litmus test 'RWC+sync':

x:=_rlx 1

a:=x_rlx 1; fence_sc; b:=y_rlx 0y:=_rlx 1; fence_sc; c:=x_rlx 0

They say that this is allowed by the C11 standard (and they propose a fix; did later versions of C++ accept their fix? http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0668r1.html refers to it, was P0668R1 adopted?).

So, it sounds like it's not easy to see how to satisfy RVWMO (which demands a global order over all memory operations) if any memory operation at all is not a seq_cst C++ atomic.

And memory_order_seq_cst implies that the memory operation so labelled is both an acquire and a release. Unfortunately, i guess C++ doesn't provide a way to demand a global total order (which i guess might be another way of saying MCA?) without also ordering each and every memory operation. In this way, RVWMO is 'weaker' (it allows you to demand a global total order without also demanding ordering every memory operation), but in another way, C++ is weaker (it allows you to not demand a global total order); overall, C++ is weaker, since operations using the C++ relaxed memory model are weaker than anything in RVWMO. wait, the above paragraph is INCORRECT; C++ seq_cst memory_order actually means " A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order (see Sequentially-consistent ordering below) " [24]. So you CAN have an acquire without a release in RCsc in C++!

The bit above about needing to use only memory_order_seq_cst for "SC for DRF" to apply is also important to me.

So, if i'm right, what should we do?

a) use RVWMO (with inefficient C++ compilation for weaker operations) b) use C++ (with its complicated memory consistency model) c) use sequential consistency (because it's the 'least upper bound' of RVWMO and C++) d) mostly use RVWMO but with a carve-out for 'relaxed' ordering (exclude those from the global order, but guarantee that fences around them do order them) e) something else (like RVWMO with carve-outs for both acq/rel and relaxed)

An argument against (a) and (c) is that it doesn't allow us to take advantage of truly relaxed ordering (which could be good for parallel reduce, which we intend to use a lot in Oot).

An argument for (c) is that the fact that we are having this discussion is even more evidence that this stuff is too complicated, so we need to just go with the simplest thing, which is SC. Higher-level programs will want SC anyways, the non-SC stuff was just for concurrency libraries.

An argument against (d) is that now i, a non-expert without the time to prove correctness theorems, is fiddling with the semantics, and i'll probably break them.

---

Aside: now that i know a little more i understand the RC11 paper a little better.

One aside to the aside is that they keep talking about acyclicity conditions; they explain that the reason is that acyclicity of a partial order that is the union of a bunch of relations is equivalent to the existence of a total order containing each of those relations.

Now, here are the things that the RC11 paper tries to fix, and what i think of it:

IRIW-acq-sc: x:=_sc 1;

a:=x_acq 1; c:=y_sc 0b:=y_acq 1; d:=x_sc 0y:=_sc 1

Z6.U: x:=_sc 1; y:=_rel 1

b:=FAI(y)_sc 1; c:=y_rlx 3y:=_sc 3; a:=x_sc 0

their fix imo makes things more complicated rather than more simple; apparently the C11 standard requires "that hb (on SC events) be included in S" (hb is happens-before, and S is the global total order induced by sequential consistency); they want to weaken this by adding an extra condition to hb edges before they need to be included in S: "provided that the hb-path between the two events either starts and ends with sb edges, or starts and ends with accesses to the same".

(sb is synchronized-before, the program order; rf is reads-from and mo is modification-order)

The final model is pretty short but pretty dense (section 3.2).

I disagree with their first fix (making IRIW-acq-sc and Z6.U allowed) but i agree with the others; that is, i agree with their strengthening and disagree with their weakening. The resulting model, while easier to read than the C++ standard, is still complicated, however.

p0668r1 makes the point that my preference of leaving IRIW-acq-sc and Z6.U disallowed at the cost of worse efficiency on Power also leads to less efficiency on GPUs:

"Nvidia GPUs have a memory model similar to Power, and share the same issues, probably with larger cost differences. For very large numbers of cores, it is natural to share store buffers between threads, which may make stores visible to different threads in inconsistent orders. This lack of "multi-copy atomicity" is also the core distinguishing property of the Power and ARM memory models. We suspect that other GPUs are also affected, but cannot say that definitively."

---

http://nil.csail.mit.edu/6.824/2016/notes/gomem.pdf says that the intent of the read axioms in the Golang memory model was to be "like DRF-SC without allowing compilers to ruin programmers’ lives."

i like that! (for Oot, not for Boot, because Boot must be easily compilable to C++)

The aforementioned rules are: " A read r is allowed to observe a write w to location v if:

A read r is therefore guaranteed to observe w if:

---

---

interestingly here Daniel Lustig, as an Nvidia guy, praises non-multi-copy-atomicity: http://mail.openjdk.java.net/pipermail/jmm-dev/2018-May.txt

in contrast to the talk https://content.riscv.org/wp-content/uploads/2018/05/14.25-15.00-RISCVMemoryModelTutorial.pdf where he seems supportive of multi-copy-atomicity (but i guess, as Chair, he was just describing the decisions reached)

the thesis https://www.repository.cam.ac.uk/bitstream/handle/1810/292229/Top.pdf?sequence=1&isAllowed=y gives more insight into RISC-V's decision-making process for multi-copy atomicity:

" Earlier versions of the RISC-V ISA manual had a relaxed non-multicopy-atomic concurrencymodel [120]. In the process of checking the C11 compilation scheme for RISC-V, Trippel et al.[120]identified several issues in this earlier memory model: it had no cumulative barriers, didnot enforce ordering resulting from dependencies, and allowed read-read coherence violations[120]. To address these shortcomings the RISC-V Memory Model Task Group was formed [26].Shaked Flur, Peter Sewell, and the author joined this Task Group in August 2017 and havesince been active participants. At that point the group was discussing various design choices for ahypothetical multicopy atomic relaxed model enforcing most syntactic dependencies, with draftaxiomatic and operational formalisations, but also the question of whether RISC-V should adoptthe TSO memory model. The Task Group’s work took place in online mail discussions and weeklyphone conference meetings, and involved debating various semantics choices: whether RISC-Vshould allow load-store re-ordering, the TSO question, and various details of a potential relaxedmodel, including same-address ordering choices, dependency questions, the semantics of certainbarriers and release/acquire, compilation schemes for C/C++, etc.After much discussion the Task Group has converged on a proposal for the RISC-V memory modelthat closely follows that of MCA ARMv8, but also incorporates some changes motivated by thedeadlock issues in the operational modelling of ARMv8’s load/store exclusive instructions describedin Section 6.3.10 (which similarly affect atomic memory operations). The RISC-V architecture alsohas an optional Ztso extension that identifies CPUs implementing the simpler TSO memory model.The group’s chair Daniel Lustig presented the proposed concurrency architecture at the RISC-Vworkshop in Barcelona, 7-10 May 2018. After the group’s informal internal voting process in July2018 that voted in favour of the proposal the model was officially ratified in late September 2018.

[120]C. Trippel, Y. A. Manerkar, D. Lustig, M. Pellauer, and M. Martonosi. TriCheck?: Memorymodel verification at the trisection of software, hardware, and ISA. InProceedings? of theTwenty-Second International Conference on Architectural Support for Programming Languagesand Operating Systems, ASPLOS 2017, Xi’an, China, April 8-12, 2017, pages 119–133, 2017.doi: 10.1145/3037697.3037719. URLhttps:doi.org/10.1145/3037697.3037719. "

" Multicopy atomicityThe question of whether RISC-V should have a non-multicopy-atomicmodel was mostly settled at the time we joined the Task Group. Although non-multicopy-atomicitywas deemed to be potentially beneficial for highly-parallel many-core systems, and “minion cores”(small cores to which the main core offloads certain I/O intensive tasks, see for example [36]),the complexity non-multicopy-atomicity would likely have added to the memory model wasconsidered too great. Consequently, the RISC-V model is multicopy atomic. "

and here was some slides during which MCA for RISC-V was a "Major debate!":

https://riscv.org/wp-content/uploads/2017/05/Wed1000am-Memory-Model-Lustig.pdf

so sometime in between 05/2017 and 05/2018 the issue was settled. The Tricheck paper was submitted to arxiv 26 Aug 2016.

So sometime between 26 aug 2016 and 05/2017 the MCA issue became a 'major debate', and then was settled before the end of 05/2018.

ah, https://docs.google.com/viewer?a=v&pid=forums&srcid=MDQwMTcyODgwMjc3MjQxMjA0NzcBMDMwNDk4NTA2OTU5OTIyNTQ3MzcBd0x3TklFcERBd0FKATAuMQFncm91cHMucmlzY3Yub3JnAXYy&authuser=6

shows that the debate was resolved by Nov. 28, 2017

(interestingly, it also says 'ld.aqrl and sd.aqrl means RCsc, not fully fenced')

the memory model working group mailing list is at:

https://lists.riscv.org/g/tech-memory-model

the archives are closed except for group members. I wonder if they would like me to join, or only active, committed people.

also, here was the announcement about the media attention w/r/t the Tricheck paper's findings and the formation of the memory model task group: https://riscv.org/2017/04/risc-v-memory-consistency-model/

announcement of the memory model committee: https://groups.google.com/a/groups.riscv.org/forum/#!searchin/isa-dev/Memory$20model

sort:date/isa-dev/Oxm_IvfYItY?/RsqlbaBoAAAJ?

stuff they wanted to cover:

" - Decide whether global store (multi-copy) atomicity is beneficial/acceptable/unacceptable. This fundamentally affects the design of the rest of the model, and hence will be the top technical priority to start

(interesting email list platform btw; https://groups.io/)

---

here's the tricheck paper; now that i know more i can appreciate it better:

http://mrmgroup.cs.princeton.edu/papers/ctrippel_ASPLOS17.pdf

it analyzes the pre-MCA version of RISC-V and goes into detail about many of the questions i had, such as, do fences sync cross-thread? globally?

the early RISC-V draft version looks similar to my draft ideas! it was relaxed, not MCA, and had no dependency ordering.

they discuss how, in order to implement C++ acq/rel atomics with fences, you have to have some sort of 'cumulative' fence, which the original version of RISC-V did not have. This is b/c acq/rel semantics are TRANSITIVE, that is, section 5.1.1 says "...C11 release-acquire synchronization is required to be transitive, ordering both accesses before a release in program order and accesses that were observed by the releasing core prior to the release.". The old RISC-V draft didn't have cumulative fences.

Specifically, to enforce only acq/rel orderings you need cumulative lightweight fences, and to enforce seq cst you need cumulative heavyweight fences, as defined in 2.3.3:

" Compiler mappings for well-known nMCA architectures such as Power and ARMv7 enforce all cumulative ordering requirements of the C11 release-acquire pair on the release side, leaving the acquire side implementation non-cumulative. In this case, the cumulative fence on the release side would require that reads and writes in the predecessor set be ordered with writes in the successor set, and that reads in the predecessor set be ordered with reads in the successor set (i.e., cumulative lightweight fence). Power mappings implement release operations using a similar cumulative lightweight fence (lwsync). With all cu- mulative orderings being enforced on the release side, acquire operations only need to locally order reads before the fence with reads and writes after the fence. Power and ARMv7 can implement acquire operations using non-cumulative fences (e.g., ctrlisync and ctrlisb, respectively 5 ). ... posed mappings for the RISC-V ISA in Section 4. C11 also supports sequentially consistent (SC) atomics. An SC load is an acquire operation and an SC store is a release operation, and there must also be a total order on all SC atomic operations that is respected by all cores. As such, the program in Figure 4 must forbid the listed outcome, as there is no total order of SC operations that would allow it. At the architecture level, cumulative lightweight fences as described for release-acquire synchronization are not sufficient to implement the required ordering for this program. Even if a cumulative lightweight fence was placed between each pair of loads on T2 and T3, neither T2 nor T3 reads from writes after the fences, so the writes observed before the fences need not be propagated to other cores. Instead, fences used to implement C11 SC atomics must be cumulative fences that order reads and writes before the fence with reads and writes after the fence (i.e., cumulative heavyweight fence). Power and ARMv7 use cumulative heavyweight fences (sync and dmb, respectively) to implement C11 SC atomics. "

they also note that many projects such as Linux require some way to enforce dependency ordering.

they recommended adding dependency ordering and lightweight and heavyweight cumulative fences (and similarly, making AMO releases cumulative), forbidding load-load reordering on the same address, adding AMO modes that allow sequential consistency with only acq or only rel rather than only .aqrl being SC for AMOs, and weakening the RISC-V draft's acquire/release synchronization so that releases only synchronize with acquires (in the draft, even a relaxed read of a previous acquire would synchronize with that acquire).

---

so it sounds like what happened is that the original RISC-V draft, probably involving Lustig, supporting memory orders of {relaxed, acquire, release, SC}, and was not MCA, TriCheck? pointed out these problems with the early RISC-V draft and made proposals to fix them without changing the nature of the draft too much, RISC-V formed the memory model consistency committee to fix it, i bet they thought that when they had tried to wing it without an axiomatic model before they made mistakes so maybe the formal model is needed, they probably then tried to make a formal axiomatic model with TriCheck?'s fixes (or maybe the TriCheck? guys had already made one), found out that that model was very complicated and confusing, and then RISC-V's ultimate response was to say, "the formal memory model required to support non-MCA and define fence cumulativity is too complicated, so instead let's require MCA and now everything's simple". Which is very different from what RISC-V had originally tried to do (as it doesn't support GPU-type things as well due to the requirement for a global ordering).

so, they started out with similar goals as me, went through a similar process to what i'm going thru, and ultimately decided that simplicity in the formal model was more important than relaxedness, and chose MCA. Which i guess is an argument for me to choose MCA too.

Also note that ARM didn't use to have MCA, but they added it with v8.

otoh i'm focused on massive parallelism in a way that they are not.

transitive acquire/release with cumulative fences does sound complicated; i can see replacing that with a global order to simplify things (so i guess i like RCsc over RCpc -- not sure if that's the same thing but it sounds like it). But relaxed ordering does not seem too complicated.

So perhaps there is a way to start with the RVWMO memory consistency model and then add in the concept of relaxed memory accesses which are not globally ordered; the alternative (d) above:

d) mostly use RVWMO but with a carve-out for 'relaxed' ordering (exclude those from the global order, but guarantee that fences around them do order them)

---

So i'd have:

Note that, as the 'Strengthen SC fences' part of P06681r1 points out, we want "a program with only memory_order_relaxed accesses and memory_order_seq_cst fences between each pair" to have sequential consistency.

i don't think i'm up for making a formal model of this; that could entail something like adopting the long formulas in section 2.2 of https://plv.mpi-sws.org/scfix/paper.pdf , although perhaps it could be simpler. But informally, hopefully most things are pretty clear:

that doesn't sound so crazy, does it? After all, it seems like it's just the RCsc model, which had both synchronizing operations that are ordered by SC, and ordinary reads/writes that are only ordered by the synchronizing operations. And if it's just RCsc, then we get SC for DRF, right? Although i guess RCsc may not have included relaxed atomics, so maybe we only immediately get SC for DRF if the program has no relaxed atomics.

I bet there is another SC for DRF theorem if you place strong conditions on the relaxed atomics. What i'm thinking of is:

If this is enough to give us a new SC for DRF theorem, which i bet it is, then this allows us to use relaxed atomic ops to implement parallel reduce. later: looks like section 3.3.2 of http://rsim.cs.illinois.edu/Pubs/msinclair-thesis.pdf covers this. So just say we're using DRFrlx. Note that the DRFrlx theorem is somewhat complicated but it's not a part of our spec; it's a theorem of it; the compiler only needs to know that these guys are 'relaxed'. So i think we're good. The requirements for DRFrlx are just (from ibid 3.3.7 page 70 pdf page 85):

"a heterogeneous system obeys DRFrlx if it is DRF1 compliant, it enforces happens-before consistency and per-location SC for atomics, and it constrains commutative, non-ordering, quantum, and speculative operation completion/propagation in the same way as data operations."

not quite sure what 'DRF1 compliant' means but i think it's this, from section 3.2.2: "DRF1 Model: A system obeys the DRF1 memory model if and only if the result of every execution of a DRF1 program on the system is the result of an SC execution of the program."

happens-before consistency and per-location SC are things that we to provide anyway:

" Happens-Before Consistency: A load L must always return the value of a store S to the same memory location M in the execution. It must not be the case that L −−→ S or that there exists another store S' to M such that S −−→ S' −−→ L.

Per-Location SC: There is a total order, T loc , on all operations to a given memory location M such that T loc is consistent with happens-before-1, and that a load L returns the value of the last store S to M that is ordered before L by T loc . "

also note that "Mechanism for distinguishing memory operations: Data-race-free-1 requires that data operations can be distinguished from atomic operations, and that paired atomics can be distinguished from unpaired atomics. We reuse existing C++ support to distinguish data and atomic operations, and we add a new annotation to distinguish paired and unpaired operations, similar to how C++ distinguishes SC atomics and relaxed atomics [142].". But i don't think we have any paired atomics; i think that would be something like lr/sc.

so i think DRFrlx is the "SC for DRF with relaxed" theorem i wanted (and then some!)

TODO put the above solution into the Boot spec, and mention the DRFrlx theorem (and state that we guarantee each of its premises).

next day... buuut... i guess the other option is just to adopt C++'s memory model, plus the fixes from RC11 that i liked. Keep RCpc as it seems more like quantumness in the physical world ('synchronization'/knowledge transitively propagates/contaminates). They already defined relaxed and unsynced stuff so all we have to do is import. Offer both RCpc and RCsc. Ignore 'consume'.

So then we'd have 3 levels of memory_orders for things:

what do we lose here? We lose the ability to annotate something with the RVWMO level of memory_order, which i think may be less than C++ SC but more than C++ relaxed (and maybe more than C++ acq/rel)? Because RVWMO at least has MCA, so even a RISC-V program containing only unannotated loads and stores, or only acq/rel stores and loads, will have a global order of globally visible writes. So we lose the ability to say, "this relaxed load or store doesn't induce any particular ordering, but there is SOME global ordering that it participates in". I'm not so sure about C++ acq/rel, because C++ memory_order_seq_cst is like memory_order_acquire or memory_order_acquire or both (depending on the operation), plus the constraint of the existence of a total order -- but wait maybe that is a little different because RVWMO globally orders writes only when they become visible to threads other than the thread doing the right.

hmm... risc-v tso annotates all loads/stores with RCpc, suggesting that it turns them into acquires/releases. And i think that what SC has that TSO lacks is store-load reordering, which i guess corresponds to those annotations being RCpc not RCsc, which i think allows a latter acquire (load) to be moved before a previous release (store). And i think C++ acquire/release is RCpc. So i think, if you were going to translate a given RISC-V instruction to a C++ one with acquire/release memory_order, and then you upgraded it to a seq_cst, the reordering opportunity that you lose might be exactly store/load. Which i think squares with something i may have read about RVWMO about how a thread can read its own writes sooner than the global store might see them; i think this might be called 'early' reading of a write, and i think it might correspond to store/load reordering.

oh right -- this might be exactly it; because the RVWMO 'load value axiom' allows for a read to read from a write which was unordered with the read in global order, but ordered before the write in program order -- i guess that corresponds to store/load reordering because the transition from program to global order takes a store that was before a load, and deletes that edge?

this suggests that what we are missing in C++ is, first, a 'consume+global total order' ordering which has a total order and dependencies (so, mb a 'consume' + total order); and also a 'TSO' ordering which is between acq/rel and SC, and which is like acq/rel + a global store total order, or like SC but with store/load reordering.

otoh since the RISC-V proposed default ld.aq and st.rl are RCsc (only the ones used for TSO were going to be RCpc; Lustig hinted to me that they weren't even considering adding those as separate instructions (which seems strange to me, b/c that would allow implementing TSO on hardware which isn't always TSO without adding a 'TSO mode' register; but otoh i guess that would add a bunch of complexity)), i think those can't reorder a load past a previous release, so in fact RISC-V RCsc loads/stores are equal to C++ memory_order_seq_cst loads/stores, and we don't lose any opportunities for reordering by considering them as such?

btw i still dont understand what those RVWMO slides meant about 'architecturally visible speculation' with load-load reordering with 'the same store'. If both loads come from the same store, how can you tell that the loads are reordered? Maybe they meant the same memory location, not the same store, and the idea is that the loads are to the same location, and some other thread stored to the location in between?

---

https://www.cs.tau.ac.il/~orilahav/seminar18/Rats.pdf shows some other possible uses of relaxed atomics:

Relaxed Atomic Category, Application Unpaired, Work Queue Commutative, Event Counter Non-Ordering, Flags Quantum, Split Counter, Reference Counter Speculative, Seqlocks

i read http://rsim.cs.illinois.edu/Pubs/msinclair-thesis.pdf section 3.3 (which was great)

---

draft reply to email thread:

Thank you, that is very helpful -- I totally missed the dependency ordering issue!

I will post a followup question to the tech-memory-model list (I didn't know about that list when I posted my original question here).

Hello all,

This a followup to the thread at https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/SoJMEZSK700 ; for context, I'm wondering how to compile each RISC-V memory operation to C/C++ atomics code. The RISC-V unprivileged spec Appendex A Table A.6 described how to compile C++ atomics to RISC-V, but this is the reverse problem. That is, for each RISC-V memory operation, I'm looking to replace it with a sequence of C/C++ atomics operations, such that this translation provides guarantees at least as strong as the original RISC-V operation. This would allow me to reason about the resulting C/C++ program using the RVWMO memory model, which is simpler than the C/C++ memory model. Assume that in the resulting C/C++ program, all operations on shared memory are 32-bit and aligned.

Consider how to map RISC-V lw and sw (ordinary RISC-V loads and stores, without any RCsc or RCpc annotation).

It seems to me that: (a) RISC-V forbids the IRIW litmus test because RISC-V has the multi-copy atomicity (MCA) property, and (b) the weakest C/C++ atomics memory_order that forbids IRIW is memory_order_seq_cst

Are (a) and (b) correct?

If so, it seems to me that RISC-V lw must be mapped to C/C++ atomic_load (memory_order_seq_cst) and RISC-V sw must be mapped to C/C++ atomic_store (memory_order_seq_cst). In other words, it seems to me that C/C++ doesn't provide any memory order weaker than sequential consistency which is able to represent any RISC-V memory operations?

For reference, the IRIW litmus test:

x := 1

a := x 1b := y 1y := 1
c := y 0d := x 0

thanks, b

My reasoning centers on pending updates to C++'s memory model and IRIW. According to https://en.wikipedia.org/wiki/C%2B%2B20 , P0668r4 was voted into C++20 in the fall meeting in November 2018 (San Diego). P0668r4 (which discussion it seems you were involved in) is here:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r4.html

In the section 'Power, ARM, and GPU implementability', it suggests that 'Alternative 2: Fix the standard' is what they are going to do, following Lahav et al.

---

I think the problem with c++ memory order consume (we know there's a problem b/c the word is that many compilers aren't bothering to implement it other than upgrading it to 'acquire', which is strictly stronger) may be that, similar to full Program Counter-based indirect control (which has the problem that, when you compile your assembly-like program into another assembly language, the sizes of various parts of the program change, so it's hard to guarantee that dynamically computed program counters will still be correct if they do arithmetic on the program counter), this feature (memory_order_consume) simply is not suitable for a high level language that will be compiled to something else. In the case of memory_order_consume the reason is that you don't want to keep track of the dependency ordering through compiler optimization transformations. It's probably easier to just keep track of release/acquire.

---

So:

then the only thing that we can't represent in C++ is RISC-V ordinary loads and stores. These are atomic, so in C++ we'd represent them as atomics. We'd like to represent these as C++ relaxed_memory_order, however, there seem to be at least two semantic guarantees that RISC-V ordinary loads and stores give that C++ relaxed_memory_order does not give:

And if we're not interested in C++ consume, then there are 4 or 5 memory_orders left to be interested in:

How i see this is:

Oot is more focused on massive parallelism, so we might care the most about relaxed and RCpc. And of course we should have seq_cst. Shall we throw in some RVWMO memory_models too? Not sure... 3 sounds like the sweet spot, 5 might be too complicated.

---

another idea for the Boot calling convention:

use the smallstacks, but ALSO have registers for more arguments in a block of memory (an int register for how many extra arguments there are, and a ptr registers to point to the memory block). So, when there is space on the smallstacks for passing arguments, you can use those, and otherwise you can use the memory block.

all registers are callee-save.

This has the advantage that the macroops idea can just be expressed using this calling convention, but this calling convention can also express calling external functions that have more arguments than the smallstacks can hold (or, if some of the smallstacks are in use, it also can still express calls).

the 'memory block' can still be a frame on an external stack.

but this would seem to have an issue: callee-save is thought to be bad for tail calls, because if registers are callee-save, then you have to restore before return, so how do you do tail calls? But in this particular case it's doable but inefficient; here if you are doing a tail call, you can just do your restore before the tail call, and then passing your arguments to the next call on the stack -- but that's way inefficient because if you are doing a self-loop via tail call, you end up saving and restoring all the registers each time through the loop. hmmm.

mb this suggests two calling conventions:

also, i think the idea that callee-save is incompatible with tail calling is incorrect/deceptive; as long as the tail-calling function doesn't actually use the callee-save registers, then there is nothing to restore and nothing to worry about. So it's inefficient in that tail-callers can use less registers than others, but that's just an efficiency tradeoff detail.

hmm, so we are kind of back where we started:

I think just having rv is a better solution than adding registers to pass extra arguments to an s-like convention.

Also, recall that we want it to be easy for higher layers to ensure that the size and layout of the stack to be statically predictable, so we want signatures for S for the macro-instruction cases, and we want separate instructions for those signatures, which due to exponential explosion is only really tenable for something small like 2 inputs and one output.

So, i guess, keep everything the same as it is now.

---

i've been making a series of posts to the RISC-V mailing lists asking for help in clarifying the relationship between RVWMO and the C/C++ memory model.

See the threads https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/SoJMEZSK700

and https://lists.riscv.org/g/tech-memory-model/topic/32530032#1866

Daniel Lustig pointed out that RVWMO demands dependency ordering at all times, which C++ relaxed does not, so C++ consume is necessary for the translation of loads. He also notes that most C++ compiles upgrade consume to acquire (presumably b/c consume is hard to implement).

Note that http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r4.html is an apparently-accepted change to the C++20 memory model that would apparently follow https://plv.mpi-sws.org/scfix/paper.pdf section 2.1, amongst other things. It sounds to me like that paper is saying that litmus test IRIW-acq-sc from their second page is forbidden by C14 but they are proposing a change that would make it acceptable (see the first sentence in the paper's section 2.1, and see section "Alternative 2: Fix the standard" in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r4.html ). This is different from but seems very similar to the IRIW.acq (IRIW with the first loads in each thread load-acquires) litmus test that i give in https://lists.riscv.org/g/tech-memory-model/message/1867 , so i'm guessing that even if C++14 forbids the one i gave, that C++20 probably will not, in which case the following translation from RVWMO/RISC-V to C/C++ is not strong enough on the C/C++ side:

lw -> atomic_load (memory_order_consume) lw.aq -> atomic_load (memory_order_seq_cst) sw -> atomic_store(memory_order_relaxed)

(the litmus test is, in RMEM format: " RISCV IRIW (* IRIW

addi x5,x0,1sw x5,0(x7)exists (1:x5=1 /\ 1:x6=0 /\ 2:x5=0 /\ 2:x6=1) " )
P1P2P3 ;
lw.aq x5,0(x7)lw.aq x6,0(x8)addi x5,x0,1;
lw x6,0(x8)lw x5,0(x7)sw x5,0(x8) ;

so it's still looking to me like there might not be any translation of RVWMO memory operations into C++ with a memory_order weaker than seq_cst.

Which still suggests to me that Boot may need to ditch RVWMO and just support the memory_orders in the C++ memory model, except consume. We'd still have a question whether we should ALSO support the RVWMO memory model or not.

If, as it seems, SC may be required in C++ to implement some relaxed accesses in RVWMO, then we certainly can't leave C++ out. Even if it turns out that you only need to enforce dependency ordering by using consume, then since consumes are usually upgraded to acquires by C++ compilers we probably want to include C++ relaxed in addition anyhow, in which case just include C++.

One part of me says, yeah if it's different then support RVWMO also, then ppl can write RVWMO programs and reason about them more simply (is there even an RMEM-like program for C++?). Another part of me says, no, if you mix C++ and something else then when faced with a program that mixes them you have to understand 2 things and how they relate.

oo here's one mb:

http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/index.html

OK here's my conclusions:

" I conclude that it looks like any compiler mapping programs from RISC-V to C++ in a context-independent manner must map ordinary, unannotated RISC-V loads and stores such that at least one of lw,sw is mapped to acq_rel, or both of them are mapped to seq_cst. " -- me at https://lists.riscv.org/g/tech-memory-model/message/1868

note: according to demsky.eecs.uci.edu/publications/c11modelcheck.ps , cppmem lacks support for fences so i can't easily check those (is there another tool like cppmem that has fences?)

---

if we want 3 fences rather than 1, then mb:

FENCE r,rw, rw,w, and rw;rw. Those seem to come up a lot.

---