proj-oot-ootAssemblyNotes26

here's some discussion on RCsc vs RCpc for RISC-V:

https://lists.riscv.org/g/tech-memory-model/message/1123

---

someone else noted https://lists.riscv.org/g/tech-memory-model/message/1164 that in general, you can have (and ARMv8 does) SC-annotated loads and stores that are not acquires and releases.

---

OK i finished skimming through the tech-memory-model RISC-V list. It's a shame that the archives aren't publicly visible, it's a goldmine that i'm sure would be useful to memory model students and other professional memory model people of the future, but without public visibility they won't get Internet Archived and they'll probably get lost sometime.

Those guys are super-experts who can compose partial orders and relate that to operational microarchitectural details in their sleep. And yet a number of issues came up at every step of the way that they had to work out. Which makes me, if anything, even more hesitant to muck around with this stuff than i was.

anyway here's my current takeaway w/r/t what kind of model we need for Boot:

---

btw the FENCE instruction may be troublesome, not only does CppMem? not support it, but ThreadSanitizer? doesn't and doesn't plan to:

" > Using ifdef's is in principle possible, though improving the tool to better handle standalone fences has its own value I believe.

Not handling standalone acquire/release fences is somewhat intentional. The problem is that they have global effect on thread synchronization. Namely, a release fence effectively turns all subsequent relaxed stores into release stores, and synchronization operations are costly to model as compared to relaxed atomic operations. An acquire fence is even worse because it turns all preceding relaxed loads into acquire loads, which means that when we handle all relaxed loads as not-yet-materialized acquire loads just in case the thread will ever execute an acquire fence later.

Even if we implement handling of acquire/release fences in tsan as an optional feature, chances are that we won't be able to enable this mode on our projects because they are large in all respects (to put it mildly).

Combining memory ordering constraints with memory access also leads to more readable code. Which is usually a better trade-off for everything except the most performance-critical code (e.g. TBB).

But, yes, it's incomplete modelling of C/C++ memory model. We just happened to get away that far (checking 100+MLOC) without stand-alone fences.

> If not that, I think we would need to better understand what exactly cannot be handled well by the tool, in order to recognize these patterns in the code and find out how to change it.

As far as I can tell that's only stand-alone acquire/release fences. seq_cst fences should work as long as they are not used also as acquire/release fences, e.g. in the test.cpp above we use seq_cst fence but also annotate memory accesses with acquire/release as necessary. " -- [1]

also, Peter Sewell pointed me to section 6.3 of https://www.cs.kent.ac.uk/people/staff/mjb211/docs/toc.pdf , which shows (among other things) that the C memory model can be simplified if you don't use fences.

i have two use-cases for fences:

Perhaps there are other ways to do each of these. I should ask Sewell and Batty.

---

OpenCL? has a (large) subset of C11 memory ops:

https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/atomicFunctions.html

https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/memory_order.html

they leave out memory_order_consume, but don't change too much else imo

---

apparently "(po U rf) acyclic" is the common way to try and rule out OOTA. eg [2]. Why don't ppl like that?

https://escholarship.org/content/qt2vm546k1/qt2vm546k1.pdf suggests that there are two main ways to rule out OOTA, enforcing dependency ordering, and enforcing load-store ordering. I guess i see why ppl don't want to do either of those.

dependency ordering is hard to compile and i guess if load-store reordering is so important to the RISC-V guys then we don't want to rule it out.

" Cost of Forbidding Load-Store ReorderingUnlike? the dependency-preserving approach, forbidding load-store reordering to avoid out-of-thin-air behaviors only affects relaxed atomics in C/C++11. Hence, for example, theload-store-order-preserving approach should impose no overhead on the SPEC CPU2006benchmarks because they do not use any C/C++ relaxed atomics. We believe that re-laxed atomics will primarily appear in concurrent data structure code, while most otherprogram code would not be affected since they would likely use other primitives that pro-vide stronger semantics, e.g., locks and atomics withmemoryorderseqcst. "

ok, from the same link, apparently an OOTA example is:

Thread 1

r1 = x; y = r1;
Thread 2
r2 = y; x = r2;

where everything is memory_order_relaxed. In C/C++ apparently r1=r2=42 is allowed.

Actually i don't see anything wrong here. My take is that it's the programmer's fault that a cycle is created. This proram should be declared illegal.

i guess in a situation like OVMhigh or Java, where you don't want to forge pointers, you might worry about it though. but i think the 'must read from a value that was placed there after EVENT' restriction solves that.

---

if load-store reordering is so important to the RISC-V guys, and it only affects relaxed atomics in C/C++11, then i guess we do want relaxed atomics?

no, that doesn't follow; load-store reordering can be used for non-atomic accesses in any case.

---

http://altair.cs.oswego.edu/pipermail/memory-model-design/2018-July/000089.html is skeptical of cost/benefit ratio for quantum atomics:

" My intuition is also that quantum atomics are too weak for as a general purpose memory_order_relaxed replacement. We commonly write code that does an initial relaxed load of an atomic value (e.g a Java lock-word) parses the results, and then proceeds accordingly. I don't think we really want to reason about what would happen if we read a random value, and about all the nonsensical control paths that might trigger. "

---

"Sequential consistency is known for its simplicity [26], and in-deed, any C11 or OpenCL? program usingexclusivelySC atomicswould enjoy a simple interleaving semantics. However, when com-bined with the more relaxed memory orders that C11 and OpenCLalso? provide, the semantics of SC atomics becomes highly complex,and it is this complexity that we tackle in this paper

https://arxiv.org/pdf/1503.07073.pdf

Overhauling SC Atomics in C11 and OpenCLMark? BattyUniversity? of Kent, UKm.j.batty@kent.ac.ukAlastair F. DonaldsonImperial? College London, UKalastair.donaldson@imperial.ac.ukJohn Wickerson

---

https://arxiv.org/pdf/1503.07073.pdf provides some suggestions to simplify the formalization of the C/C++ memory consistency order model

---

this paper mentions my question of how to deal with malloc and demallocing atomics:

https://www.cs.kent.ac.uk/people/staff/mjb211/docs/the_problem_of_programming_language_concurrency_semantics.pdf

their issue is a little different and more insurmountable than mine, but they don't have a great answer for theirs.

---

i'm not sure anymore that it's a good idea to leave out RCpc:

C++ now has a function std::reduce, which requires that the reduction operator is both associative and commutative. Since this was a key motivation for my use of relaxed atomics, let's see which memory_order it uses.

as of a few months ago, it wasn't implemented yet in either GCC or Clang [3] , but it is in:

https://github.com/intel/parallelstl

in that, it seems to be implemented here:

https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/include/tbb/parallel_reduce.h

the key part is probably near the top, starting at line 42:

    //! Task type used to combine the partial results of parallel_reduce.
    /** @ingroup algorithms */
    template<typename Body>
    class finish_reduce: public flag_task {
        //! Pointer to body, or NULL if the left child has not yet finished.
        bool has_right_zombie;
        const reduction_context my_context;
        Body* my_body;
        aligned_space<Body> zombie_space;
        finish_reduce( reduction_context context_ ) :
            has_right_zombie(false), // TODO: substitute by flag_task::child_stolen?
            my_context(context_),
            my_body(NULL)
        {
        }
        ~finish_reduce() {
            if( has_right_zombie )
                zombie_space.begin()->~Body();
        }
        task* execute() __TBB_override {
            if( has_right_zombie ) {
                // Right child was stolen.
                Body* s = zombie_space.begin();
                my_body->join( *s );
                // Body::join() won't be called if canceled. Defer destruction to destructor
            }
            if( my_context==left_child )
                itt_store_word_with_release( static_cast<finish_reduce*>(parent())->my_body, my_body );
            return NULL;
        }
        template<typename Range,typename Body_, typename Partitioner>
        friend class start_reduce;
    };

you can see one thing that looks like an atomic here, 'itt_store_word_with_release', which looks like it atomically stores the result of the reduction. Elsewhere in the file there is also 'itt_load_word_with_acquire', and there is also some sort of allocation stuff and task stuff which may be concurrency library calls. So let's find the implementation of 'itt_store_word_with_release' and see what memory_order it is. It appears to be from:

 tbb/src/tbbmalloc/Customize.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/src/tbbmalloc/Customize.h
        template <typename T>
        inline void itt_store_word_with_release(T& dst, T src) {
  1. if TBB_USE_THREADING_TOOLS call_itt_notify(releasing, &dst);
  2. endif TBB_USE_THREADING_TOOLS FencedStore?(*(intptr_t*)&dst, src); }

so it's really FencedStore?. FencedStore? appears to be from

 tbb/src/tbbmalloc/Synchronize.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/src/tbbmalloc/Synchronize.h

inline void FencedStore?( volatile intptr_t &location, intptr_t value ) { __TBB_store_with_release(location, value); }

so it's really __TBB_store_with_release. __TBB_store_with_release appears to be from

 tbb/include/tbb/tbb_machine.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/include/tbb/tbb_machine.h

template<typename T, typename V> inline void __TBB_store_with_release(volatile T& location, V value) { machine_load_store<T,sizeof(T)>::store_with_release( location, T(value) ); }

so it's really machine_load_store::store_with_release. This appears to be defined in the same file, but also in other files. In this file, we have a few defns:

  1. if __TBB_USE_GENERIC_HALF_FENCED_LOAD_STORE / Fenced operations use volatile qualifier to prevent compiler from optimizing them out, and on architectures with weak memory ordering to induce compiler to generate code with appropriate acquire/release semantics. On architectures like IA32, Intel64 (and likely Sparc TSO) volatile has no effect on code gen, and consistency helpers serve as a compiler fence (the latter being true for IA64/gcc as well to fix a bug in some gcc versions). This code assumes that the generated instructions will operate atomically, which typically requires a type that can be moved in a single instruction, cooperation from the compiler for effective use of such an instruction, and appropriate alignment of the data. / template <typename T, size_t S> struct machine_load_store { static T load_with_acquire ( const volatile T& location ) { T to_return = location; __TBB_acquire_consistency_helper(); return to_return; } static void store_with_release ( volatile T &location, T value ) { __TBB_release_consistency_helper(); location = value; } };

in general, plain load and store of 32bit compiler is not atomic for 64bit types

  1. if __TBB_WORDSIZE==4 && __TBB_64BIT_ATOMICS template <typename T> struct machine_load_store<T,8> { static T load_with_acquire ( const volatile T& location ) { return (T)__TBB_machine_load8( (const volatile void*)&location ); } static void store_with_release ( volatile T& location, T value ) { __TBB_machine_store8( (volatile void*)&location, (int64_t)value ); } };
  2. endif /* __TBB_WORDSIZE==4 && __TBB_64BIT_ATOMICS */
  3. endif /* __TBB_USE_GENERIC_HALF_FENCED_LOAD_STORE */

what is __TBB_release_consistency_helper? But first, one of the other files that defines this is

 tbb/include/tbb/machine/gcc_generic.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/include/tbb/machine/gcc_generic.h

template <typename T, size_t S> struct machine_load_store { static T load_with_acquire ( const volatile T& location ) { return __TBB_machine_atomic_load<T, __ATOMIC_ACQUIRE>(location); } static void store_with_release ( volatile T &location, T value ) { __TBB_machine_atomic_store<T, __ATOMIC_RELEASE>(location, value); } };

aha! this appears to be from GCC's atomics, which are modeled after C/C++11, and this appears to be the memory_order_release: http://www.rowleydownload.co.uk/arm/documentation/gnu/gcc/_005f_005fatomic-Builtins.html

ok, back to __TBB_release_consistency_helper. This is defined in many files, including tbb/include/tbb/machine/gcc_generic.h . There's an #if on the version of GCC. We'll look at the later version:

  1. else __TBB_GCC_VERSION >= 40700; use __atomic_* builtins available since gcc 4.7
  2. define __TBB_compiler_fence() __asm__ __volatile__("": : :"memory") Acquire and release fence intrinsics in GCC might miss compiler fence. Adding it at both sides of an intrinsic, as we do not know what reordering can be made.
  3. define __TBB_acquire_consistency_helper() __TBB_compiler_fence(); __atomic_thread_fence(__ATOMIC_ACQUIRE); __TBB_compiler_fence()
  4. define __TBB_release_consistency_helper() __TBB_compiler_fence(); __atomic_thread_fence(__ATOMIC_RELEASE); __TBB_compiler_fence()

So, yeah, again it's memory_order_release.

So, the use-case i most care about, parallel_reduce, appears to use memory_order_release and memory_order_acquire. I believe that's RCpc (not RCsc, not relaxed). The one memory order that i was hoping to leave out.

And it makes sense, because Relaxed doesn't sync at all, and SC is global; RCpc is local and yet provides some synchronization, so i can see how it could be useful.

---

so can i leave out relaxed? lets look at http://rsim.cs.illinois.edu/Pubs/msinclair-thesis.pdf again, section 3.3 3.3 Relaxed Atomic Use Cases and DRFRlx Model.

let's see.. Unpaired Atomics Use Case :: Work Queue sounds useful. Here one thread is checking if another thread has put anything in its mailbox. Until there is something in the mailbox, there is no need to sync.

The given Commutative Atomics use case is an event counter, which i'm not sure if i care about, although i still feel like it could be useful for parallel reduce. The Non-Ordering Atomics use-case is in some ways similar to the previous, although i guess it's different, and i guess i care about it.

Speculative Atomics uses sequence numbers and hence is either not usable in all situations or is not 100% deterministically safe (what if the sequence number wraps around...). So i don't really care for this case.

The quantum case may be a bit too far, as pointed out in http://altair.cs.oswego.edu/pipermail/memory-model-design/2018-July/000089.html and quoted above. Also, it requires significantly complicating the formal model.

So it looks like, if i don't need speculative or quantum, i can stick with DRFrlx Formal Definition (Version 2) on page 61 after the section on non-ordering.

So, looks like, unsurprisingly, Relaxed may be useful.

---

So i don't think i want to give up much of C/C++'s memory_consistency_model. So i have:

lw: non-atomic load lw.a: atomic load (SC). C atomic_load(memory_order_seq_cst) (RCsc) lw.a.rc: atomic load (release consistency). C atomic_load(memory_order_acquire) (RCpc) lw.a.rlx: atomic load (relaxed). C atomic_load(memory_order_relaxed) ... and the matching stores ... cas: C atomic_compare_exchange_weak(memory_order_seq_cst) cas.rc: C atomic_compare_exchange_weak(memory_order_acq_rel) cas.rlx: C atomic_compare_exchange_weak(memory_order_relaxed) and, not sure if i need this next one, but possibly: fence(memory_domain): C atomic_thread_fence(memory_order_seq_cst) (strengthened as per http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html , section 'Strengthen SC fences') with an extra memory_domain argument indicating which shared memory domain to sync

(that's 12 instructions in total)

---

actually https://en.wikipedia.org/wiki/Release_consistency claims that RCpc is sufficient for DRF? http://pages.cs.wisc.edu/~markhill/theses/sarita_adve.pdf seems to me to disagree but i'm not sure i'm understanding correctly; he says "two flavors of release consistency(RCsc and RCpc)" and "A variant of release consistency where special accesses are sequentially consistent is also proposed [GLL90]. This modelis abbreviated as RCsc, while the model with processor consistent special accesses is abbreviated as RCpc. Gharachorloo et al. identify a set of software constraints for which a system that obeys RCsc appears sequentially consistent. Programs that obey these constraints are called properly labeled (PL) programs."

---

this thesis deals with giving a DRF-like result for RCpc (see the part about 'PLpc2'):

http://pages.cs.wisc.edu/~markhill/theses/sarita_adve.pdf

it'd good but imo complicated. i guess DRFrlx probably shares the same problem.

in both cases, maybe what they are saying is that, if you analyze your program carefully, even when synchronizing instructions are not always SC, you can often prove that your program is still SC. And they provide some theorems to prove this in the common cases. But the theorems' premises are sufficiently complicated that i think unless there is compiler/linter/automated reasoning support, or unless there was a serious demand for correctness in their application, most programmers would not bother.

so these are more of an analysis tool for later than something we really need right now, although they are useful to inform our design choices.

---

i'm kind of running out of time for this issue.

i think we'd better stick to a subset of the C/C++ memory consistency model rather than try to roll our own.

i think the C/C++ memory model formalization is way too complicated but i don't have time to learn how to work on it myself, and so i can't tell if there are any design choices i could make here that would help simplify. And my project isn't big enough to be worth the experts' attention. If we get traction later then maybe we can get someone to come in and simplify everything.

as to which memory_orders, i think we need a SC. And i think we don't want consume (or dependencies in general). And i think we do need some sort of local synchronizaton. So, in C/C++, that forces us to have seq_cst, acquire, release. And i don't see how Relaxed is that much more complicated; we already have to support non-atomic accesses, this is just like those but atomic.

And although the tooling and theories seem to think that fences are not worth the complication they bring, i don't really see how we can deal with free'ing a block of memory upon which ordinary (e.g. relaxed) accesses were performed in the past unless we use a fence.

Perhaps a fence is just like an acquire (SC) to a dummy location followed immediately by a release (SC). C/C++ SC fences don't seem to be that simple, because 'Strengthen SC fences' in [4] suggests that they are defined independently of that? We should put this in the documentation in case it's useful.

Wait, [5] says that their fix is to 'Model SC-fences as acquire/release atomic updates of a distinguished fence location.'. But these are the same people whose paper was the basis of 'Strengthen SC fences' in [6]. The RC11 paper says "Among these, Lahavet al.[17]proposed strengthening the semantics of SC fences in a dif-ferent way from the way we do here, by treating them asread-modify-writes to a distinguished location. That strength-ening, however, was considered in the restricted setting ofonly release/acquire accesses, and does not directly scaleto the full set of C11 access modes. In fact, for the frag-ment containing only SC fences and release/acquire accesses,RC11-consistency is equivalent to RA-consistency that treatsSC fences as RMWs to a distinguished location". So they are saying it's not as simple as that?

as for OOTA and UB, i think that we just say:

---

" In the case of C/C++11, Batty et al.[2011]established the correctness of a mapping to x86-TSO, while Batty et al.[2012] addressed the mapping to POWER and ARMv7. However, the correctness claims of Batty et al.[2012] weresubsequently found to be incorrect [Lahav et al.2017; Manerkar et al.2016], as they mishandledthe combination of sequentially consistent accesses with weaker accesses. Lahav et al.[2017]developed RC11, a repaired version of C/C++11, and established (by pen-and-paper proof) thecorrectness of the suggested compilation schemes to x86-TSO, POWER and ARMv7." -- http://plv.mpi-sws.org/imm/paper.pdf

---

this seems to be a method for computing if a program with release/acquire satisifes a DRF theorem:

[7]

---

a comment near the bottom of https://lists.riscv.org/g/tech-memory-model/message/1123 appears to be saying that if you have RCpc lw.aq and sw.rl, and if you have a fence 'which orders all earlier release stores with all later acquire loads' then you can make those RCpc instructions into RCsc ones by adding that fence.

(note: "roach motel reordering" seems to be the term for when stuff outside of an acquire..release critical section moves into it. This can happen even with RCsc i thin. The other thing that can happen with RCpc that i think is prevented with RCsc is acquire..release..acquire..release and the second acquire moves before the first release)

hmm.. that only costs one opcode (for the fence) instead of 2 (for lw.acsc and sw.rlsc). But more generally, is it better to have a fence or instructions for this? The fence can't help but reorder ALL releases before it from ALL acquires afterwards, whereas RCsc is more specific? But what happens if you have "sw.rl; lw.aqsc" anyways? Can the sw.rl move past the lw.aqsc because it is not SC, or does the 'sc' in the lw.aqsc block it? I dunno, but at least with the fence idea, i know i wouldn't even be asking.

another question i have... if with an acquire 'All writes in other threads that release the same atomic variable are visible in the current thread', then i assume that applies to the same thread too, and that acquire..release..acquire..release can't overlap even with RCpc if both of the critical section (that is, the 'acquire..release's) are acquiring/releasing the same memory location (and indeed, by coherence (i think), all writes each single memory location are totally ordered anyways). So they're only overlapping if those are different variables. So, again, it seems to be that the difference between RCpc and RCsc is that RCsc requires global ordering, global over both threads, and over memory locations.

---

btw https://lore.kernel.org/lkml/20180705183836.GA3175@andrea/ suggests that the term RCsc is not even well-defined!

also, multiple times i've seen someone say in the RISC-V WG that the performance difference between RCpc and RCsc is small. That goes against my intuition that RCsc is a global sync operation and RCpc (assuming NON-cumulativity) is local between the threads concerned. But maybe my intuition is wrong? Or maybe the 'small performance difference' is in the context of RISC-V where multi-copy atomicity (a global notion of ordering) is already assumed?

also, remember that everyone but me is thinking of up to like 50 CPUs, where i'm thinking of a million.

---

old text in boot_reference that i am about to remove and rewrite:

"

Formal memory consistency model

put this somewhere: Perhaps a fence is just like an acquire (SC) to a dummy location followed immediately by a release (SC). C/C++ SC fences don't seem to be that simple, because 'Strengthen SC fences' in [8]