proj-oot-ootAssemblyNotes25

(June 30 2019)

So i'm mostly happy with Boot right now (even though it's not fully implemented or specified and what is implemented and specified is not quite in agreement with my latest plans esp. regarding redoing the jump label stuff to make stuff compilable; i do plan to fix that sometime soon), and i've stopped focusing on it and moved up to thinking about OVM.

I still have two issues/decisions to make about potentially changing Boot though. I'm trying to think if we could make things even simpler:

Both of these potential changes come out of the same line of thinking:

Now that i'm thinking about OVM, Boot's function as an easy-to-implement portability layer is coming into focus. If people want something really easy to implement, they will implement Boot, not BootX?. And if they want to do more work to get better interop or performance, they will implement OVMlow, not BootX?. So really, the purpose of BootX? is some sort of project-internal efficiency or readability thing, not something that porters will really be concerned with. Except, right now if they want to add floating point functionality, or 64-bit int functionality, or many of the syscalls, they have to switch to BootX?. Which is why i'm thinking of (a) making Boot an all-inclusive, one-stop-shop for porters; so that if they implement Boot, they're done; and (b) making Boot even simpler by switching to the 32-bit representation.

In essence, 'Boot' would be its own layer, and BootX? (although still separately specified and probably separately useful for other projects) would from the point of view Oot porters be though of as 'part of that OVM layer stuff that we don't have to think about'.

Otoh, there's also something to be said for the current setup:

Either way though, i'm pretty happy with Boot. Which is why i'm thinking more about OVM.

---

For source code without any atomics/synchronization/fences (for example "a = 3; b = a + 2; print(b)"), i'd like to compile this code in the most obvious way on each platform (for example, on many platforms the previous example could compile to placing a constant value into a location, adding another constant value to that location, and then making a function call with the location as argument, with no fencing/memory barriers/synchronization at all). But if my language's memory ordering consistency model is stronger than that of some target platform, then in order to deliver the guarantees of my language's memory ordering consistency model, in some cases i may have to insert some sort of synchronization into the compiled code on that target platform.

To avoid having to think about that when porting to a new platform, if possible, i'd like to choose a sufficiently weak memory ordering consistency model for my language, so that on all of the likely target platforms the model's guarantees are delivered without me having to further understand the target platform's memory consistency model, at least for source code without synchronization. (in fact, the previous sentence is actually a little optimistic; the pragmatic reality is that i won't understand the target platform's memory consistency model very well and so i will just compile all unsynchronized code 'in the obvious way' without any thought at all, regardless of whether or not that fits the guarantees of the language's model; i'm just trying to choose a memory consistency model for the language the justifies this behavior, so that users of the language aren't surprised. Of course i could just say "the ordering of events (as observed by any thread except the thread that generated the events) is completely undefined in the absence of explicit synchronization" but that seems a bit much.).

Some major target platforms are RISC-V, C, ARM, x86-64.

So i figure that what i should do is make my language's memory consistency model something close to the strongest thing that is still weak enough such that meeting the guarantees of any of those major target platforms' memory consistency models implies that the guarantees of my language's memory consistency model are met.

So which one is weakest? Rather, are any of the those platforms' memory ordering consistency models weaker/stronger than any of the others for code without atomics, fences, memory barriers, or synchronization. And more precisely, given an arbitrary piece of unsynchronized code (perhaps barring pathologically platform-specific code), if the same code were translated into each of these platforms, the certain guarantees on ordering would apply on each platform depending on which memory ordering consistency model that platform offers -- is there any pairs of the platforms in this list such that one member of the pair will always satisfy not just its own guarantees, but also all of the guarantees promised by the other member of the pair?

eh, i think that's not QUITE my question, although it mostly is.

First, i'm willing to accept a not-completely-relaxed memory model provided that it seems like it's something that "everyone's doing" (that is, it's common enough that if i end up porting to a platform with a weaker model then i'll probably know, because the fact that this platform has such a weak model will be something is so unusual that people comment on it).

[1] suggests that ARM has Release Consistency (RC), and that link and also [2] suggest that x86 has something 'close to TSO'. And i sorta recall from the RISC-V discussion that TSO is stronger than RC. And https://github.com/riscv/riscv-isa-manual/blob/master/src/memory.tex says that RISC-V's memory model is called RVWMO and is a variant of release consistency.

So that suggests that, yeah, RISC-V's memory model is the weakish one that i seek.

The current RISC-V spec draft says: https://github.com/riscv/riscv-isa-manual/releases/download/draft-20190626-9f0e234/riscv-spec.pdf

"although the ISA does not currently contain native load-acquire or store-release instructions, nor RCpc variants thereof, the RVWMO model itself is designed to be forwards-compatible with the potential addition of any or all of the above into the ISA in a future extension. "

so is it really even a RC model?

The RISC-V FENCE instruction talks about whether reads or writes are being ordered, but not about acquire vs release.

Which C11/C++11 memory order corresponds to RC? You'd think it would be using memory_order_acquire and memory_order_release, or mb memory_order_acq_rel.

Is a RISC-V FENCE a C11 memory_order_acq_rel?

The RISC-V memory model also lets you choose between the PC (processor consistency) and SC (sequential consistency) variants of RC (that is, are the RC atomics ordered with respect to each other using PC or SC?).

Confusingly, in https://stackoverflow.com/questions/16179938/c11-memory-order-acquire-and-memory-order-release-semantics , two people think that the C folks totally got the meaning of the RC memory model wrong; but looking at https://en.cppreference.com/w/cpp/atomic/memory_order , it looks like the wording they are complaining about is not there, so perhaps it was somehow fixed already?

Also, https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering states that Release-Consume ordering (dependency ordering) is implicit "On all mainstream CPUs other than DEC Alpha". Does this mean that it holds between all memory accesses even ones in code with no synchronization (my guess is, yes, it does mean that). Does RISC-V share this property?

So some other questions (todo read back over this section and add more):

(a little later): ok i researched some more and got some more info:

stop the presses: the spec chapter on the Atomics extension has some stuff to say here: "The AMOs were designed to implement the C11 and C++11 memory models efficiently. Although the FENCE R, RW instruction suffices to implement the acquire operation and FENCE RW, W suffices to implement release, both imply additional unnecessary ordering as compared to AMOs with the corresponding aq or rl bit set." and "To provide more efficient support for release consistency [5], each atomic instruction has two bits, aq and rl, used to specify additional memory ordering constraints as viewed by other RISC-V harts...If both bits are clear, no additional ordering constraints are imposed on the atomic memory op- eration. If only the aq bit is set, the atomic memory operation is treated as an acquire access, i.e., no following memory operations on this RISC-V hart can be observed to take place before the acquire memory operation. If only the rl bit is set, the atomic memory operation is treated as a release access, i.e., the release memory operation cannot be observed to take place before any earlier memory operations on this RISC-V hart. If both the aq and rl bits are set, the atomic memory operation is sequentially consistent and cannot be observed to happen before any earlier memory operations or after any later memory operations in the same RISC-V hart and to the same address domain."

mb also read relevant sections of "Exploring the Trisection of Software, Hardware, and ISA in Memory Model Design"

so if i'm right, it sounds like RVWMO is good, and FENCE-sq is good, and we need to add acquire_sc &mem and release_sc &mem.

---

mb should draft a post to a RISC-V memory model guru or email list to ask my questions.

Hello, i'm designing a VM for a higher-level language, and i'm trying to understand some of the RISC-V memory consistency model. The VM may be implemented in RISC-V but also in other platforms such as C, so i'm trying to come up with instructions for the VM that can be implemented on various platforms, while also understanding how to use those VM instructions to guarantee data race freedom and/or sequentially consistent execution of programs written on top of the VM.

My current idea (suggestions welcome!) is to have 5 classes of memory instructions in my VM:

My goal is to choose a set of memory instructions for my VM that can be described by a formal memory model, and can allow the expression of efficient algorithms in the future (i.e. to allow programs to be written on the VM without unnecessary synchronization), but then to write the initial implemention of the VM itself in a conservative (oversynchronized) manner (to save myself trouble during the initial implementation). I also want to understand how to easily write programs in the language of the VM in a conservative (oversynchronized) manner, again to save myself from having to debug concurrency bugs before i get far enough along to care about performance.

In other words I want to create VM instructions that can express weaker ordering, but for the time being I want to implement these in a conservative manner by actually providing a stronger ordering than the instructions guarantee.

I had a few questions about RISC-V and about how the RVWMO model relates to others.

1) Is there a guide somewhere for how to implement C/C++ atomics on RISC-V, and if not, is there a recommended/correct implementation/C/C++ compiler whose source code could serve as a tutorial?

2) In RISC-V, in order to blindly make any program sequentially consistent, is it sufficient to place a FENCE R,RW after each load and place a FENCE RW,W before each store?

3) If one was trying to implement a FENCE instruction with similar or stronger guarantees as RISC-V's FENCE (excluding the I/O and TSO functionality), how could one implement that in C or C++, using C atomics?

4) How does C/C++'s https://en.cppreference.com/w/cpp/atomic/atomic_thread_fence, which has a memory_order argument, relate to RISC-V's FENCE, which has flags like rw,rw?

5) On a RISC-V platform, how should one implement C/C++'s atomic_thread_fence with memory_order=seq_cst?

6) If RVWMO is multi-copy atomic then why are RCpc annotations even mentioned in the draft RVWMO chapter of the spec? isn't the whole point of Processor Consistency that it is not multi-copy atomic?

7) (i guess this question isn't really RISC-V specific) On any platform whose only memory operations are: * non-synchronized LOAD/STORE instructions, and also * synchronizing RCpc LOAD-ACQUIRE and STORE-RELEASE instructions, if data races are defined as a pair of unsynchronized instructions at least one of which is a write, and, if a particular program can be shown to have sufficent use of synchronizing instructions to eliminate all data races (that is, to induce an ordering over every pair which would otherwise be a data race, under the formal memory model of the platform), then: is that program guaranteed to be sequentially consistent on that platform?

8) Does the addition of a CAS instruction to the platform change the answer to the previous question?

thanks, bayle

OK stop the presses again, there is an Appendix A in the draft spec that gives tables of how to implement the C atomics in terms of RISC-V instructions.

Two questions i had in the appendix was:

---

probably need load_acquire and store_release variants of load and store? so might have to get rid of 'bne' to make room :(. Alternately (probably better), have 1-argument fences:

load_acquire &mem store_release &mem

---

should memory_domain be an integer (like an I/O channel descriptor) or a pointer (so we can point to a memory_domain descriptor in memory)? i guess it should be like an I?/O channel descriptor. All of POSIX, Windows, Plan 9 appear to use integer file descriptors (well, Windows' "HANDLE" data type is an opaque 32-bit value but i think it's internally used as an integer index into an object table), so i guess we should make it an integer too.

---

7 of the 16 possible #fence_type combinations in FENCE seem to be meaningless, as they imply that either the prior flags or the subsequent flags are all empty (, r, w, rw, ,r ,w ,rw). Do these have any effect? If not, should we use them for something else?

---

i changed it back to say that there is undefined behavior ('UB') on arithmetic overflow. This is because C has UB on signed arithmetic overflow, and we want to be able to just directly compile to C without additional checks.

also we want to allow the implementation to replace our int32s with register-width (often 64-bit) quantities, which means that we can't guarantee wraparound arithmetic:

"TL;DR: C doesn't work very well if int!=register width, but (for backwards compat) int is 32-bit on all major 64-bit targets, and this causes quite hairy problems for code generation and optimization in some fairly common cases. The signed overflow UB exploitation is an attempt to work around this." [16]

---

the more information we put into registers that doesn't really need to be there, the fewer 'real' registers we have, which means the less stuff that we'd want to map into 'real' registers on, say, and x86_64 assembly implementation, which is nice because the implementation needs a few x86_64 registers for its own use.

In addition, it's nice because then each thread costs less memory, so we can have more Oot threads in total.

some more information that we COULD put into registers for this purpose would be:

PC in r15, interrupts (flags that say if an interrupt of 16 types are pending, as well as if interrupts of those 16 types are masked) in r14, misc. status register (for future use; stuff like what protection ring are we currently running in) in r13.

we could also cache the top two stack items instead of just TOS.

should also look at what else the erlang process control block has.

---

recall that the reason for just passing arguments on smallstack in the calling convention was that if you have caller saved registers anyways, you may as well use them for arguments.

---

apparently we want callee cleanup for tail calls, i've heard

---

A nice little machine might have:

What i'm suggesting here is replacing SMALLSTACKs with even smaller ones that fit wholly in registers. This makes them unusable for macrocompiliation but for ordinary execution it would be fast.

The reason I mention these is that it's nice to have stack wholly in regs on x86_64. If the stack is wholly in regs then you can do stack pushes/pops without touching main memory.

4 stack locs is minimum to have: return address, two args, and a place to put a return value from a subroutine (without clobbering arguments). This suggests that at least 3 stack locs should be cached in registers (so that you could push a return address and two arguments before calling a function).

Also, if we have at least 2 stack locs cached in registers, then we can do a SWAP on the stack by using a register SWAP operation; so we wouldn't need separate stack SWAP operations (otoh we can already do that by doing register SWAP on $S, i guess).

Since you can still do normal register ops on the 3 stack locs, this isn't really much worse than just having 8 registers; a register-oriented program could just ignore the stack and use those as registers (it's a little worse if the program does use the stack though, because then it cant use the two new stack registers as ordinary registers, because they'd get clobbered; so now there are only 3 really-registers instead of 5, which is a big difference). And we can still provide 16 registers as normal in the R2, R1 instruction encodings.

To push in this machine, you would write to S (R1). Let's say the stack locs are in registers R2,R3,R4. Pushing would have the effect of

and popping would have the effect of:

If we specify further that the value of R4 MIGHT be saved somewhere else when you push, then this could be extended to a partially-cached, in-memory stack without breaking anything (popping already has 'undef R4' so if R4 is getting some deeper stack item that's okay).

So we could specify that in Boot and BootX?, the stack works like this, and then in OVM (where we want bigger smallstacks for macrocompiliation), we could expand smallstack to size 16 or 32. Then when Boot instructions are executed (including in OVM), we could just not spill the stack (which gives us no memory access delay on stack ops), and then when we are in OVM mode, we could.

hmm, i kinda like that.. we keep some of the encoding density benefits of having a stack, while getting much faster execution.

otoh is it really that much faster? according to [17] and [18] and [19], L1 cache latency is not much worse than register; [20] says it is about 3x as slow though, but then talks about write-back and store-forwarding which reduce that. According to [21] there is about a 6-cycle latency for L1 cache. I guess that's significant (especially since to push or pop the stack we'd have to access L1 cache at least three times; once to read the stack pointer or depth, once to spill or unspill the stack, then once to write the new stack pointer or depth back).So we should still expect a significant speedup this way.

even better, when using macroinstructions, what if we specify that stack locations above the 'current stack frame' shouldn't be accessed by the macroinstruction? Now we can give each macroinstruction 3 fresh smallstack locations (and maybe we put the arguments in there too, and they return in there too) and they won't be allowed to access the caller's smallstack locations. As long as the macroinstructions never need to access the caller's stack, now we can do the macroinstruction thing without making smallstack bigger (and having to spill the stack to memory during execution, except at the point where we call/return from macroinstructions).

does this means that macroinstructions are distinguished from ordinary instructions in that they can't take operands of R2, R3, R4? No, because remember, the VM is copying the operands into/out of the smallstack when it calls the macroinstruction anyway; so if say, R3, is given an an argument operand, the macroinstruction will find a copy of the value of R3 in its smallstack, and if it's the destination operand, the result will be copied into it afterwards.

what this does mean is that the macroinstruction implementation itself needs another stack, into/from which the smallstack is being spilled upon macroinstruction call.

so does this mean that now writing OVMlow in Boot is harder? Well, what we can do is incorporate the other stack into Boot, and add explicit PUSH and POP instructions which are like pushing/popping via S, except that they do spill to the rest of the stack. So the S registers pushes/pops the small small stack, and push and pop work on the smallstack. Of course these would be distinct from push and pop that would work on memory stacks.

hmm, i kinda like this!

otoh, if we do this, is the S register really that useful? Macroinstructions can only access 3 stack slots anyways, so why not just use them as registers always and have S spill to memory like it currently does? Is the moving-stuff-around that a stack provides that useful if we only have 3 slots? I guess for the 8-bit encoding it helps a lot to have an S, but does it help a lot to have a faster S with only 3 slots?

in a way, the main gain is just that, because 3 stack locs are cached instead of only 1, the macroinstructions can do their thing without expensive stack spills -- even if the meaning of S is unchanged. So it might be worth it to cache the top 3 stack slots even if we don't have a way to push/pop only to those slots.

but my tentative guess is that, yes, changing the meaning of S will be worth it too.

---

so:

ok i think it's not worth it to change the definition of S. The times you really want a stack is when there is something like function calling, anyways.

of course, even 3 regs isn't enough to hold the 4 operands in the 64-bit encoding. But presumably one of those will often be the destination register.

---

since $0 doesn't need to be stored, a Boot x86_64 implementation can store the PC there. But it probably wants some more registers, in order to store its own stackpointer, and a few temporary registers.

so i think two of the 3 GPRs are the ones that should probably go into memory: $6, $7 (for two temporaries) and &6, &7.

That gives the implementation 5 registers to work with, which should be enough for the PC, a stackpointer, and 3 temporaries.

---

Displacement mode, if we have one in BootX?, should be relative to r3, not tos.

Should we use a frame pointer for displacement mode instead of stack ptr? That would mean putting a FP in R3.

---

other name ideas: gemini biplane twin deuce janus bivalve binocular duplex duo bifocal bifold

---

i noticed that:

This all suggests to me that a strong/sequentially consistent (SC)/RW fence (presumably like Power's sync, ARMv7's dmb, RISC-V's FENCE RW,RW, RC11's fence with mode seq_cst) is the most common/important type of fence.

When one sees that Boot has a FENCE instruction with modes like R,RW available, you start thinking about a bunch of hairy complexities like what the ramifications of that are, how it related to C's fences with modes like seq_cst, etc. A Boot implementor could overlook the other types of fences and just upgrade them to a FENCE RW,RW, but before they do that they'll probably spend time trying to figure this other stuff out and get annoyed at how complicated it is, and worried that they don't understand it (at least, that what i did); so let's not make them think.

So i'm removing the modes to the Boot FENCE instruction; there will only be one FENCE instruction, it will be RW and seq_cst.

now i can remove the following notes from the spec, which were marked t o d o:

is this true: a FENCE rw,rw is like (?or mb, at least as strong as?) a C atomic_thread_fence with memory_order_seq_cst do fences synchronize with other fences on the same memory_domain in an SC manner? 7 of the 16 possible #fence_type combinations in FENCE seem to be meaningless, as they imply that either the prior flags or the subsequent flags are all empty (, r, w, rw, ,r ,w ,rw). Do these have any effect? If not, should we use them for something else?

---

from my concordance, i see that "Note that the intersection of these C/C++ AMOs with the set of AMOs provided by RISC-V and LLVM is: swap, add, and, or, xor."

we already have swap (CAS). So i just now added:

add_amo ap_amo and_amo or_amo xor_amo

i also added sequentially consistent (SC) synchronization variants of these:

{add, ap, and, or, xor}_amo_sc

and also cat_sc

that's a lot of opcodes wasted on synchronization, though, i dunno if i like that. I think those guys could be built from the atomic load/store and fence and cas primitives anyways? Maybe this stuff should go in the OVM layer. Also, although Bridging the Gap between Programming Languages andHardware Weak Memory Models has both CAS and FADD AMOs (figure 2. Programming language syntax), appears to Repairing Sequential Consistency in C/C++11 appears to have only CAS.

Also, if the platform we are on has only CAS and doesn't support ll/sc, it's not immediately obvious to me how to implement add_amo without a runtime that reserves extra memory for a lock or somesuch. https://preshing.com/20150402/you-can-do-any-kind-of-atomic-read-modify-write-operation/ proposes a way but i think that would e.g. sometimes lose concurrent increments due to the ABA problem (e.g. if two threads are doing an atomic add at the same time, the shared variable might only be incremented once). Oh wait a minute, no it's fine; if another thread does an increment in between when the shared variable is fetched and the CAS, then the CAS fails the precondition; and if, in between the fetch and the CAS, one other thread increments, then another thread decrements, the precondition succeeds but that's fine because the semantics of atomic addition is fulfilled on the resulting state.

ok in that case i guess it does make sense to have these in here. The CAS implementation of the relaxed memory order variants is so simple that we can leave it to the OVM level; but it may make sense to have the SC memory order variants, because although (i think?) those can be conservatively implemented by surrounding them with FENCE instructions (?is that correct?), some platforms (such as RISC-V) might offer more performant/fine-grained variants (that synchronize only those AMOs, rather than synchronizing everything like a fence).

---

should our ldrc and strc have RCpc or RCst semantics?

"PLandDRF1allowforalloptimizationsofWOandRCsc[2,5."

2.S.V.AdveandM?.D.Hill,AUnifiedFormalizationofFourShared?-MemoryModels?,ComputerSciencesTechnicalReport?#1051,UniversityofWisconsin?,Madison,September1991.SubmittedforPublication?.

5. K.Gharachorloo,D.Lenoski,J. Laudon,P.Gibbons,A.GuptaandJ?. Hennessy,MemoryConsistencyandEventOrderinginScalableShared?-MemoryMultiprocessors?,Proc.17thAnnualIntl.Symp.onComputerArchitecture, May1990,15-26

" Ournewmemorymodel,calledPLpc3, extendsourpreviousframeworktoallowtheprogrammertoreasonwithSC,whileallowingtheadditionaloptimizationsofTSO,PSO,PCandRCpcaswell?.It achievesthisbyrequiringtheprogrammertoadditionallyidentifya commonclassofsynchronizationaccesseswhereoneprocesswaitsona locationuntila particularvalueis writtenbyanotherprocess. ... Wefurthercategorizethecompetingaccessesasloopornon-loop.Figure2 summarizesthiscategorization.Althoughourdefinitionsmayseemcom-plex,theintuitionis fairlysimpleandtheinformationwerequire,i.e.,whetheranaccessis competingornotandwhetherit is a loopaccessornot,is naturallyavailabletotheprogrammerinmostcases. ...

Definition4:LoopandNon?-LoopReads:A competingreadis aloopreadif (i)it is thefinalreadofa synchronizationloopconstructthatterminatestheconstruct,(ii)it competeswithatmostonewriteinanySCexecution,and(iii)if it competeswitha writeinanSCexecution,thenit returnsthevalueofthatwrite;i.e.,thewriteisnecessarytomakethesynchronizationloopconstructterminate.Acompetingreadthatis nota loopreadis anon-loopread.Definition5:LoopandNon?-LoopWrites:Acompetingwriteisaloopwriteif it competesonlywithloopreadsineverySCexecution.A competingwritethatis nota loopwriteis anon-loopwrite "

"However,sincemanycompileroptimizationsrequirethefullflexibilityofreorderingbothreadandwriteaccesses,webelievetheextragainfromthelimitedreorderingbetweencompetingaccessesasallowedbyPLpcwillbemoreprominentinhardwarethaninthecompiler."

kind of hard to read without spaces but i think the important parts are:

this suggests that we should just switch to RCsc, and furthermore than RCpc is hard to provide a DRF-SC like guarantee for.

hmm, actually these Armv8.3-A release notes say that C/C++ acquire/release/acq_rel memory orders are RCpc, not RCst:

" Memory consistency model

The Armv8.0 support for release consistency is based around the “RCsc” (Release Consistency sequentially consistent) model described by Adve & Gharacholoo in [1], where the Acquire/Release instructions follow a sequentially consistent order with respect to each other. This is well aligned to the requirements of the C++11/C11 memory_order_seq_cst, which is the default ordering of atomics in C++11/C11.

Instructions are added as part of Armv8.3-A to support the weaker RCpc (Release Consistent processor consistent) model where it is permissible that a Store-Release followed by a Load-Acquire to a different address can be re-ordered. This model is supported by the use of memory_order_release/ memory_order_acquire /memory_order_acqrel in C++11/C11.

[1] Adve & Gharachorloo Shared Memory Consistency Models: A Tutorial " -- Armv8-A architecture: 2016 additions

---

just for future note, i always say that Oot doesn't care about performance, so why does Boot have non-atomic, relaxed, and release concurrency memory instructions? Wouldn't it be easier to only expose sequential consistency?

There are two answers. One is that Oot doesn't care about performance, but Boot does, because it's lower-level.

The second answer is that Oot is especially focused on concurrency, so i'm erring in the direction of making the Oot ecosystem supporting a wide variety of concurrency primitives.

---

i think the memory consistency model specification is getting too complicated and i should adopt something more like RVWMO with multi copy atomicity.

i think in fact, the implementation that would violate non-multi-copy-atomicity is not even a distant server, but rather shared write-back caches shared between some but not all clients of that server; now clients who share a write-back cache see each others' writes before everyone else, b/c they see those writes before the server does! so there is sort of not even a single source of truth! i think multi-copy-atomicity can coexist with distant servers and network delays (as much as anything with a strong FENCE can, at least, and we will have that).

(not sure about what i just said there though...)

---

yeah, i think that's right; the multi-copy-atomicity constraint is that all other processors besides the one doing the write see the write at the 'same time'. But we're not talking clock time here, we're talking ordering vs. other events. So a big delay to reach the server is fine; all we need to prevent is the server somehow receiving the write, noting it and serving it to some clients, yet still serving the old value to other clients. One could imagine that some sort of inconsistent (in the CAP sense; CAP consistency is 'Every read receives the most recent write or an error') database would fail this. But i imagine that an inconsistent database could have other problems here (like not being able to properly implement FENCE), depending on what kind of inconsistency it was (that is, what the database's consistency model is).

anyhow, i started to sketch out a formal memory model and i quickly found it getting nearly as complicated as C++'s, which i think is too complicated. The only thing keeping me from using RISC-V's RVWMO model was my desire to keep it more flexible for relaxed implementations (eg ones without multi-copy atomicity), but i don't think this is worth all that complexity. And if we can just use RVWMO model then we have the huge benefits that the model was designed by experts, got lots of peer review, has already had related work done like the tables in the RISC-V spec telling you how to port memory ordering instructions for various other platforms to RISC-V and tooling like the excellent RMEM and herd which apparently has a RVWMO model [22], and has a community that we can go to with questions.

and after looking at this longer and looking at C++'s formal memory model, etc, i think the RISC-V one, as complicated as it is, is simpler than most others (maybe the Golang model is even simpler but not by much, and it's at a slightly higher level of abstraction). So multi-copy atomicity probably really is required for any sort of simplicity.

so i'm going to throw out my formal memory model sketch (actually, i'll copy it somewhere in case i change my mind) and go with RVWMO.

still have to decide on the discrepencies. For example, since we have the concept of non-atomic updates which RISC-V doesn't have except for mixed sizes, which it admits aren't really formalized well yet anyway, does their presence mess everything up? Or does our 'data races undefined' fix that? Currently, our 'data races undefined' constraint only requires no data races in every SC execution, but should we strengthen it to require no data races in every RVWMO execution? Probably, yes.

--- as strong or stronger. If someone's already thought about this in code or

--- ok new questions to maybe ask the RVWMO folks:

Hi, the excellent RISC-V unprivileged spec, Appendex A, described how to implement C++ atomics in RISC-V, but i'm curious about the reverse problem.

i'm thinking about how one could implement RISC-V-style memory operations in C or C++ using C14/C++14 atomics, such that the new memory operations would provide the same guarantees as the corresponding RISC-V ones do; so that, if i used these new operations for all concurrency in some C++ program, that C++ program would obey all of the guarantees of the RVWMO memory model.

I'm particularly interested in how to map the following 9 RISC-V operations into C++ atomics (assume that all operations on shared memory are always 32-bit, aligned, and use only the new RISC-V style operations). As a start, i've provided the reverse of the mappings in Table A.6. This isn't necessarily the right answer, because maybe Table A.6 mapped some weaker C++ constructs to stronger RISC-V ones, but i need to map each RISC-V instruction to a C++ atomic which is at least as strong or stronger. If someone's already thought about this in code or in a paper, please point me to it.

So, in the mapping that i just gave, which items are wrong? For example, i assume that C++ memory_order_relaxed isn't multi-copy atomic, so does that mean that RISC-V lw shouldn't be mapped to C++ atomic_load(memory_order_relaxed) like i did above? The larger question is, in which items above is the C++ replacement on the right of the arrow weaker than the RISC-V instruction on the left of the arrow, taking into account that i want all of the guarantees of the RVWMO memory model to be upheld in the resulting C++ program?

thanks, b

---

save this other one for later:

1) in RISC-V atomics extension ('A'), the .aqrl mode of AMOs are specified to be sequentially consistent -- was this a choice, or a consequence of RVWMO? For example, in C++, there is memory_order_acq_rel, which is weaker than memory_order_seq_cst. Would it have been possible in RVWMO to make a mode of AMO which was both an acquire and a release, but not sequentially consistent (SC), or does the nature of RVWMO imply that anything which is both an acquire and a release is always also SC?

---

note to self: in that memory consistency order part in ootAssemblyNotes18.txt, the mnemonic that RISC-V was thinking of using for RCsc lr was lr.aqrl.

---

in my analysis of RISC-V Geneology in the Concordance, here are 15 core instructions:

This is 15 instructions.

This suggests a simpler 16-bit encoding (swap out BEQ for BLT; and in the interpretation of ADDI's constant, have spots for INT32_SZ and PTR_SZ)... of course we want that darn encoding length format bit. We could add a MORE instruction and stuff the shifts, logicals, SUBs, and SCALL into 2-operand instructions under the MORE. Or, we could do what we did, and reduce the number of registers from 16 to 8, freeing up 3 bits, one of which is spent on encoding length format, with 2 bits left.

---

Yesterday I asked RISC-V isa-dev how it would be possible to write a program in C++ respecting RVWMO:

https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/SoJMEZSK700

--

Hi, the excellent RISC-V unprivileged spec, Appendex A, described how to implement C++ atomics in RISC-V, but i'm curious about the reverse problem.

i'm thinking about how one could implement RISC-V-style memory operations in C or C++ using C14/C++14 atomics, such that the new memory operations would provide the same guarantees as the corresponding RISC-V ones do; so that, if i used these new operations for all concurrency in some C++ program, that C++ program would obey all of the guarantees of the RVWMO memory model.

I'm particularly interested in how to map the following 9 RISC-V operations into C++ atomics (assume that all operations on shared memory are always 32-bit, aligned, and use only the new RISC-V style operations). If someone's already written a C++ library for this or thought about it on a mailing list or in a paper, please point me to it. As a start, i've provided below the reverse of the mappings in Table A.6. This isn't necessarily the right answer, because maybe Table A.6 mapped some weaker C++ constructs to stronger RISC-V ones, whereas i need to map each RISC-V instruction to a C++ atomic which is at least as strong or stronger.

(btw, should i have written lw.aqrl and sw.aqrl for the RCsc hypotheticals?)

So, in the mapping that i just gave, which items are wrong? For example, i assume that C++ memory_order_relaxed isn't multi-copy atomic, but i want my C++ program to obey RVWMO just like RISC-V does, so does that mean that RISC-V lw shouldn't be mapped to C++ atomic_load(memory_order_relaxed) like i did above? The larger question is, in which items above is the C++ replacement on the right of the arrow weaker than the RISC-V instruction on the left of the arrow, taking into account that i want all of the guarantees of the RVWMO memory model to be upheld in the resulting C++ program?

thanks, bayle

--

(WHY did i misspell 'Appendix' in the very first line?!? i should have used a spellchecker)

No replies yet. The more i think about it, the more i think that RVWMO'ss multi-copy-atomicity property can only be acheived (blindly/without knowing which program we are compiling) by using only memory_order_seq_cst in C++.

The RISC-V spec says: "The RVWMO memory model is defined in terms of the global memory order, a total ordering of the memory operations produced by all harts. In general, a multithreaded program has many different possible executions, with each execution having its own corresponding global memory order. The global memory order is defined over the primitive load and store operations generated by memory instructions."

So it seems to me that ALL memory operations participate in this global memory order.

How is having a global memory order different from sequential consistency? Lamport defined 'sequential consistency' as "... the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program."[1]

It seems therefore that RVWMO satisfies the first condition of 'sequential consistency' but not the second; as the spec says, "The global memory order for any given execution of a program respects some but not all of each hart’s program order.". Which is why the rest of the RVWMO model spec focuses on defining WHICH parts of program order are 'preserved' in the 'preserved program order' ("The subset of program order that must be respected by the global memory order is known as preserved program order.").

So, in RVWMO, ALL memory operations, not just ones with RCsc annotations, not just synchronization ops, are in a global memory order.

How do different processors ever observe things in different orders then? I think it's because the 'Load Value Axiom' allows a read to read EITHER the last write before the read in the Global Memory Order, OR the last write before the read in Program Order.

Now let's look at C++. I don't have the actual standard, but here's a draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf

The 'memory_order's are defined in clause 29.3, page 1145. 29.3.3 on page 1146:

" There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happensbefore” order and modification orders for all affected locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values: —(3.1)the result of the last modificationAofMthat precedesBinS, if it exists, or —(3.2)ifAexists, the result of some modification ofMthat is notmemory_order_seq_cstand that does nothappen beforeA, or —(3.3)ifAdoes not exist, the result of some modification ofMthat is notmemory_order_seq_cst "

and 29.3.8 says:

" Note: memory_order_seq_cst ensures sequential consistency only for a program that is free of data racesand uses exclusivelymemory_order_seq_cstoperations. Any use of weaker ordering will invalidate thisguarantee unless extreme care is used. In particular,memory_order_seq_cstfences ensure a total orderonly for the fences themselves. Fences cannot, in general, be used to restore sequential consistency for atomicoperations with weaker ordering specifications.— end note] "

And i don't see anything about a single total order for the other non-seq_cst memory_orders. There are a bunch of statements about how seq_cst fences order atomic operations even if they are non-seq_cst however; but the note at 29.3.8 says that it's hard to make use of that to order everything else. How could this be?

The first answer to [23] suggests that the problem is that "observing thread can simply observe in a different order, and will not matter what fences you are using in the observed thread", that is, if the observing thread has a weaker memory order then the seq_cst fence will order writes but another observing thread may in some cases disregard this ordering.

https://plv.mpi-sws.org/scfix/paper.pdf also goes into this in Section 2.2, "Second Problem: SC Fences are to weak". They say that "In general, one might expect that inserting an SC fence between every two instructions restores sequential consistency. This holds for hardware memory models, such as x86-TSO, Power, and ARM, for programs with aligned word-sized accesses (for their analogue of SC fences), but holds neither in the original C11 model nor in its strengthening [5]...". They give a litmus test 'RWC+sync':

x:=_rlx 1

a:=x_rlx 1; fence_sc; b:=y_rlx 0y:=_rlx 1; fence_sc; c:=x_rlx 0

They say that this is allowed by the C11 standard (and they propose a fix; did later versions of C++ accept their fix? http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0668r1.html refers to it, was P0668R1 adopted?).

So, it sounds like it's not easy to see how to satisfy RVWMO (which demands a global order over all memory operations) if any memory operation at all is not a seq_cst C++ atomic.

And memory_order_seq_cst implies that the memory operation so labelled is both an acquire and a release. Unfortunately, i guess C++ doesn't provide a way to demand a global total order (which i guess might be another way of saying MCA?) without also ordering each and every memory operation. In this way, RVWMO is 'weaker' (it allows you to demand a global total order without also demanding ordering every memory operation), but in another way, C++ is weaker (it allows you to not demand a global total order); overall, C++ is weaker, since operations using the C++ relaxed memory model are weaker than anything in RVWMO. wait, the above paragraph is INCORRECT; C++ seq_cst memory_order actually means " A load operation with this memory order performs an acquire operation, a store performs a release operation, and read-modify-write performs both an acquire operation and a release operation, plus a single total order exists in which all threads observe all modifications in the same order (see Sequentially-consistent ordering below) " [24]. So you CAN have an acquire without a release in RCsc in C++!

The bit above about needing to use only memory_order_seq_cst for "SC for DRF" to apply is also important to me.

So, if i'm right, what should we do?

a) use RVWMO (with inefficient C++ compilation for weaker operations) b) use C++ (with its complicated memory consistency model) c) use sequential consistency (because it's the 'least upper bound' of RVWMO and C++) d) mostly use RVWMO but with a carve-out for 'relaxed' ordering (exclude those from the global order, but guarantee that fences around them do order them) e) something else (like RVWMO with carve-outs for both acq/rel and relaxed)

An argument against (a) and (c) is that it doesn't allow us to take advantage of truly relaxed ordering (which could be good for parallel reduce, which we intend to use a lot in Oot).

An argument for (c) is that the fact that we are having this discussion is even more evidence that this stuff is too complicated, so we need to just go with the simplest thing, which is SC. Higher-level programs will want SC anyways, the non-SC stuff was just for concurrency libraries.

An argument against (d) is that now i, a non-expert without the time to prove correctness theorems, is fiddling with the semantics, and i'll probably break them.

---

Aside: now that i know a little more i understand the RC11 paper a little better.

One aside to the aside is that they keep talking about acyclicity conditions; they explain that the reason is that acyclicity of a partial order that is the union of a bunch of relations is equivalent to the existence of a total order containing each of those relations.

Now, here are the things that the RC11 paper tries to fix, and what i think of it:

IRIW-acq-sc: x:=_sc 1;

a:=x_acq 1; c:=y_sc 0b:=y_acq 1; d:=x_sc 0y:=_sc 1

Z6.U: x:=_sc 1; y:=_rel 1

b:=FAI(y)_sc 1; c:=y_rlx 3y:=_sc 3; a:=x_sc 0

their fix imo makes things more complicated rather than more simple; apparently the C11 standard requires "that hb (on SC events) be included in S" (hb is happens-before, and S is the global total order induced by sequential consistency); they want to weaken this by adding an extra condition to hb edges before they need to be included in S: "provided that the hb-path between the two events either starts and ends with sb edges, or starts and ends with accesses to the same".

(sb is synchronized-before, the program order; rf is reads-from and mo is modification-order)

The final model is pretty short but pretty dense (section 3.2).

I disagree with their first fix (making IRIW-acq-sc and Z6.U allowed) but i agree with the others; that is, i agree with their strengthening and disagree with their weakening. The resulting model, while easier to read than the C++ standard, is still complicated, however.

p0668r1 makes the point that my preference of leaving IRIW-acq-sc and Z6.U disallowed at the cost of worse efficiency on Power also leads to less efficiency on GPUs:

"Nvidia GPUs have a memory model similar to Power, and share the same issues, probably with larger cost differences. For very large numbers of cores, it is natural to share store buffers between threads, which may make stores visible to different threads in inconsistent orders. This lack of "multi-copy atomicity" is also the core distinguishing property of the Power and ARM memory models. We suspect that other GPUs are also affected, but cannot say that definitively."

---

http://nil.csail.mit.edu/6.824/2016/notes/gomem.pdf says that the intent of the read axioms in the Golang memory model was to be "like DRF-SC without allowing compilers to ruin programmers’ lives."

i like that! (for Oot, not for Boot, because Boot must be easily compilable to C++)

The aforementioned rules are: " A read r is allowed to observe a write w to location v if:

A read r is therefore guaranteed to observe w if:

---

---

interestingly here Daniel Lustig, as an Nvidia guy, praises non-multi-copy-atomicity: http://mail.openjdk.java.net/pipermail/jmm-dev/2018-May.txt

in contrast to the talk https://content.riscv.org/wp-content/uploads/2018/05/14.25-15.00-RISCVMemoryModelTutorial.pdf where he seems supportive of multi-copy-atomicity (but i guess, as Chair, he was just describing the decisions reached)

the thesis https://www.repository.cam.ac.uk/bitstream/handle/1810/292229/Top.pdf?sequence=1&isAllowed=y gives more insight into RISC-V's decision-making process for multi-copy atomicity:

" Earlier versions of the RISC-V ISA manual had a relaxed non-multicopy-atomic concurrencymodel [120]. In the process of checking the C11 compilation scheme for RISC-V, Trippel et al.[120]identified several issues in this earlier memory model: it had no cumulative barriers, didnot enforce ordering resulting from dependencies, and allowed read-read coherence violations[120]. To address these shortcomings the RISC-V Memory Model Task Group was formed [26].Shaked Flur, Peter Sewell, and the author joined this Task Group in August 2017 and havesince been active participants. At that point the group was discussing various design choices for ahypothetical multicopy atomic relaxed model enforcing most syntactic dependencies, with draftaxiomatic and operational formalisations, but also the question of whether RISC-V should adoptthe TSO memory model. The Task Group’s work took place in online mail discussions and weeklyphone conference meetings, and involved debating various semantics choices: whether RISC-Vshould allow load-store re-ordering, the TSO question, and various details of a potential relaxedmodel, including same-address ordering choices, dependency questions, the semantics of certainbarriers and release/acquire, compilation schemes for C/C++, etc.After much discussion the Task Group has converged on a proposal for the RISC-V memory modelthat closely follows that of MCA ARMv8, but also incorporates some changes motivated by thedeadlock issues in the operational modelling of ARMv8’s load/store exclusive instructions describedin Section 6.3.10 (which similarly affect atomic memory operations). The RISC-V architecture alsohas an optional Ztso extension that identifies CPUs implementing the simpler TSO memory model.The group’s chair Daniel Lustig presented the proposed concurrency architecture at the RISC-Vworkshop in Barcelona, 7-10 May 2018. After the group’s informal internal voting process in July2018 that voted in favour of the proposal the model was officially ratified in late September 2018.

[120]C. Trippel, Y. A. Manerkar, D. Lustig, M. Pellauer, and M. Martonosi. TriCheck?: Memorymodel verification at the trisection of software, hardware, and ISA. InProceedings? of theTwenty-Second International Conference on Architectural Support for Programming Languagesand Operating Systems, ASPLOS 2017, Xi’an, China, April 8-12, 2017, pages 119–133, 2017.doi: 10.1145/3037697.3037719. URLhttps:doi.org/10.1145/3037697.3037719. "

" Multicopy atomicityThe question of whether RISC-V should have a non-multicopy-atomicmodel was mostly settled at the time we joined the Task Group. Although non-multicopy-atomicitywas deemed to be potentially beneficial for highly-parallel many-core systems, and “minion cores”(small cores to which the main core offloads certain I/O intensive tasks, see for example [36]),the complexity non-multicopy-atomicity would likely have added to the memory model wasconsidered too great. Consequently, the RISC-V model is multicopy atomic. "

and here was some slides during which MCA for RISC-V was a "Major debate!":

https://riscv.org/wp-content/uploads/2017/05/Wed1000am-Memory-Model-Lustig.pdf

so sometime in between 05/2017 and 05/2018 the issue was settled. The Tricheck paper was submitted to arxiv 26 Aug 2016.

So sometime between 26 aug 2016 and 05/2017 the MCA issue became a 'major debate', and then was settled before the end of 05/2018.

ah, https://docs.google.com/viewer?a=v&pid=forums&srcid=MDQwMTcyODgwMjc3MjQxMjA0NzcBMDMwNDk4NTA2OTU5OTIyNTQ3MzcBd0x3TklFcERBd0FKATAuMQFncm91cHMucmlzY3Yub3JnAXYy&authuser=6

shows that the debate was resolved by Nov. 28, 2017

(interestingly, it also says 'ld.aqrl and sd.aqrl means RCsc, not fully fenced')

the memory model working group mailing list is at:

https://lists.riscv.org/g/tech-memory-model

the archives are closed except for group members. I wonder if they would like me to join, or only active, committed people.

also, here was the announcement about the media attention w/r/t the Tricheck paper's findings and the formation of the memory model task group: https://riscv.org/2017/04/risc-v-memory-consistency-model/

announcement of the memory model committee: https://groups.google.com/a/groups.riscv.org/forum/#!searchin/isa-dev/Memory$20model

sort:date/isa-dev/Oxm_IvfYItY?/RsqlbaBoAAAJ?

stuff they wanted to cover:

" - Decide whether global store (multi-copy) atomicity is beneficial/acceptable/unacceptable. This fundamentally affects the design of the rest of the model, and hence will be the top technical priority to start

(interesting email list platform btw; https://groups.io/)

---

here's the tricheck paper; now that i know more i can appreciate it better:

http://mrmgroup.cs.princeton.edu/papers/ctrippel_ASPLOS17.pdf

it analyzes the pre-MCA version of RISC-V and goes into detail about many of the questions i had, such as, do fences sync cross-thread? globally?

the early RISC-V draft version looks similar to my draft ideas! it was relaxed, not MCA, and had no dependency ordering.

they discuss how, in order to implement C++ acq/rel atomics with fences, you have to have some sort of 'cumulative' fence, which the original version of RISC-V did not have. This is b/c acq/rel semantics are TRANSITIVE, that is, section 5.1.1 says "...C11 release-acquire synchronization is required to be transitive, ordering both accesses before a release in program order and accesses that were observed by the releasing core prior to the release.". The old RISC-V draft didn't have cumulative fences.

Specifically, to enforce only acq/rel orderings you need cumulative lightweight fences, and to enforce seq cst you need cumulative heavyweight fences, as defined in 2.3.3:

" Compiler mappings for well-known nMCA architectures such as Power and ARMv7 enforce all cumulative ordering requirements of the C11 release-acquire pair on the release side, leaving the acquire side implementation non-cumulative. In this case, the cumulative fence on the release side would require that reads and writes in the predecessor set be ordered with writes in the successor set, and that reads in the predecessor set be ordered with reads in the successor set (i.e., cumulative lightweight fence). Power mappings implement release operations using a similar cumulative lightweight fence (lwsync). With all cu- mulative orderings being enforced on the release side, acquire operations only need to locally order reads before the fence with reads and writes after the fence. Power and ARMv7 can implement acquire operations using non-cumulative fences (e.g., ctrlisync and ctrlisb, respectively 5 ). ... posed mappings for the RISC-V ISA in Section 4. C11 also supports sequentially consistent (SC) atomics. An SC load is an acquire operation and an SC store is a release operation, and there must also be a total order on all SC atomic operations that is respected by all cores. As such, the program in Figure 4 must forbid the listed outcome, as there is no total order of SC operations that would allow it. At the architecture level, cumulative lightweight fences as described for release-acquire synchronization are not sufficient to implement the required ordering for this program. Even if a cumulative lightweight fence was placed between each pair of loads on T2 and T3, neither T2 nor T3 reads from writes after the fences, so the writes observed before the fences need not be propagated to other cores. Instead, fences used to implement C11 SC atomics must be cumulative fences that order reads and writes before the fence with reads and writes after the fence (i.e., cumulative heavyweight fence). Power and ARMv7 use cumulative heavyweight fences (sync and dmb, respectively) to implement C11 SC atomics. "

they also note that many projects such as Linux require some way to enforce dependency ordering.

they recommended adding dependency ordering and lightweight and heavyweight cumulative fences (and similarly, making AMO releases cumulative), forbidding load-load reordering on the same address, adding AMO modes that allow sequential consistency with only acq or only rel rather than only .aqrl being SC for AMOs, and weakening the RISC-V draft's acquire/release synchronization so that releases only synchronize with acquires (in the draft, even a relaxed read of a previous acquire would synchronize with that acquire).

---

so it sounds like what happened is that the original RISC-V draft, probably involving Lustig, supporting memory orders of {relaxed, acquire, release, SC}, and was not MCA, TriCheck? pointed out these problems with the early RISC-V draft and made proposals to fix them without changing the nature of the draft too much, RISC-V formed the memory model consistency committee to fix it, i bet they thought that when they had tried to wing it without an axiomatic model before they made mistakes so maybe the formal model is needed, they probably then tried to make a formal axiomatic model with TriCheck?'s fixes (or maybe the TriCheck? guys had already made one), found out that that model was very complicated and confusing, and then RISC-V's ultimate response was to say, "the formal memory model required to support non-MCA and define fence cumulativity is too complicated, so instead let's require MCA and now everything's simple". Which is very different from what RISC-V had originally tried to do (as it doesn't support GPU-type things as well due to the requirement for a global ordering).

so, they started out with similar goals as me, went through a similar process to what i'm going thru, and ultimately decided that simplicity in the formal model was more important than relaxedness, and chose MCA. Which i guess is an argument for me to choose MCA too.

Also note that ARM didn't use to have MCA, but they added it with v8.

otoh i'm focused on massive parallelism in a way that they are not.

transitive acquire/release with cumulative fences does sound complicated; i can see replacing that with a global order to simplify things (so i guess i like RCsc over RCpc -- not sure if that's the same thing but it sounds like it). But relaxed ordering does not seem too complicated.

So perhaps there is a way to start with the RVWMO memory consistency model and then add in the concept of relaxed memory accesses which are not globally ordered; the alternative (d) above:

d) mostly use RVWMO but with a carve-out for 'relaxed' ordering (exclude those from the global order, but guarantee that fences around them do order them)

So i'd have:

Note that, as the 'Strengthen SC fences' part of P06681r1 points out, we want "a program with only memory_order_relaxed accesses and memory_order_seq_cst fences between each pair" to have sequential consistency.

i don't think i'm up for making a formal model of this; it would entail

TODO implement that.

---