Bayle Shanks's website: proj-oot-ootAssemblyNotes26

here's some discussion on RCsc vs RCpc for RISC-V:

https://lists.riscv.org/g/tech-memory-model/message/1123

---

someone else noted https://lists.riscv.org/g/tech-memory-model/message/1164 that in general, you can have (and ARMv8 does) SC-annotated loads and stores that are not acquires and releases.

---

OK i finished skimming through the tech-memory-model RISC-V list. It's a shame that the archives aren't publicly visible, it's a goldmine that i'm sure would be useful to memory model students and other professional memory model people of the future, but without public visibility they won't get Internet Archived and they'll probably get lost sometime.

Those guys are super-experts who can compose partial orders and relate that to operational microarchitectural details in their sleep. And yet a number of issues came up at every step of the way that they had to work out. Which makes me, if anything, even more hesitant to muck around with this stuff than i was.

anyway here's my current takeaway w/r/t what kind of model we need for Boot:

i don't completely understand the discussion on multi-copy-atomicity. A lot of people seemed to be strongly against it early on and then when the proposal for scopes, which i don't fully understand yet, came up everyone was then like, 'oh okay in that case it's fine'. The scopes proposal said that every memory ordering command would have a scope, but i don't see that in the final spec, so are those people still happy? They seem to be. Also, apparently load-store reordering takes care of a lot of the demands for further weakness, because it allows you to transmit memory operations from processors to memory storage over a fabric with variable latency. Anyhow, the upshot for me is that my worry that the RVWMO model is not weak enough for whatever i want to do, b/c of multi-copy atomicity may not be true. Two theory teams argued very forcefully that RVWMO, with its load-store reordering, was already too weak and that the memory model could be a lot simpler if they'd only give that up. Then again, what i want to do (truly massive parallelism) is rather exotic so i might need more weakness than others.
the idea that RVWMO can't be locally compiled to weak C++ (as i think i showed (modulo the possibility of fences in the translation, which CppMem? can't test; but i doubt they will change the outcome)) with weak memory orders means that i probably don't think RVWMO should be our memory model, at least not the sole one.
So, combining the trouble compiling RVWMO to weak C++, and the fear that RVWMO isn't weak enough, therefore i guess i'm deciding that it shouldn't be the sole memory model. So the next contender is C++20.
one thing i learned from the discussion is that some think that RVWMO is 'a lot like ARMv8'. I don't understand the details enough to make my own judgement but presumably RVWMO was created slightly later and with at least a little input from some of the people who worked on ARMv8 (although otoh a lot of RVWMO had been decided before the people at RISC-V were informed about ARMv8, so maybe not), and at least with less concern about backwards-compatibility.
so the remaining big question is, should we add a RVWMO memory_order to C++20's memory_orders?
as before, the pro is that RVWMO is well-thought out, recent, fills a useful niche, and qualitatively different from C++20's other memory_orders. The con is (a) the simple complexity of just having more 'stuff' in our language, and (b) now we have to define how C++20's memory consistency model and RISC-V's memory consistency model interoperate.
my sense of the role of RVWMO is the same as before, but i have even more evidence for this now. RVWMO is balancing formal simplicity with what chip-builders need, for moderately parallel systems. It eschews stuff (like non-multi-copy-atomicity and non-dependency-ordering) that is found not to be practically useful, in order to make the formal model simpler. One note however is that there are things that are useful to low-level models, like ordering dependencies, which are not so great for HLL models; so the idea that RVWMO might be good for a portable VM just because it's good for hardware needs to be examined critically (and in fact just now Peter Sewell replied to my list emails with a suggestion that it may not be useful, at least for the goal of "transforming C++ code in order to simplify the parts C/C++ concurrency model that you have deal with").
- And i think it succeeds; i feel like i sort of understand much of the RISC-V memory consistency ordering spec already, and that i could understand the rest with a few days of study if i needed to (surely this is an illusion of competence, but compared to the C++ memory consistency model spec there's a big difference; i've skimmed both specs, and right now i feel like i could understand the C++ memory consistency model eventually but it would take a long, long time; when i have an unanswered question with RISC-V's memory consistency model spec i feel like i can dig in right now and at least attempt to answer it myself, but when i have an unanswered question with C++s's i just think it's hopeless unless i can budget a few days to learn more about the C++ model first).
so, if the RVWMO model is so good, it's a shame it doesn't compile to C++ in a reasonable way, but otoh, the fact that it's not easily reducible to C++'s memory orders also means that it's different enough to be useful.
being the way i am, i'm leaning towards adding in RVWMO too. But then i'd have to think about how it relates to C++. Also, is it okay for me to just muck with it? for example, i could just delete RVWMO rules 9-11 (syntactic dependencies). My guess is that that wouldn't have any bad effects on global properties of the memory model, but of course any change like that means the tooling (e.g. RMEM) doesn't quite match up. Similarly, mixing the RVWMO and C++ memory models at all will prevent users from completely relying on either RMEM or CppMem?.
i should also list some of the 'features'/design choices of the RVWMO that people seemed to think were important:
- multi-copy-atomicity
- load-store reordering
- dependency ordering
- RCsc release/acquire

---

btw the FENCE instruction may be troublesome, not only does CppMem? not support it, but ThreadSanitizer? doesn't and doesn't plan to:

" > Using ifdef's is in principle possible, though improving the tool to better handle standalone fences has its own value I believe.

Not handling standalone acquire/release fences is somewhat intentional. The problem is that they have global effect on thread synchronization. Namely, a release fence effectively turns all subsequent relaxed stores into release stores, and synchronization operations are costly to model as compared to relaxed atomic operations. An acquire fence is even worse because it turns all preceding relaxed loads into acquire loads, which means that when we handle all relaxed loads as not-yet-materialized acquire loads just in case the thread will ever execute an acquire fence later.

Even if we implement handling of acquire/release fences in tsan as an optional feature, chances are that we won't be able to enable this mode on our projects because they are large in all respects (to put it mildly).

Combining memory ordering constraints with memory access also leads to more readable code. Which is usually a better trade-off for everything except the most performance-critical code (e.g. TBB).

But, yes, it's incomplete modelling of C/C++ memory model. We just happened to get away that far (checking 100+MLOC) without stand-alone fences.

> If not that, I think we would need to better understand what exactly cannot be handled well by the tool, in order to recognize these patterns in the code and find out how to change it.

As far as I can tell that's only stand-alone acquire/release fences. seq_cst fences should work as long as they are not used also as acquire/release fences, e.g. in the test.cpp above we use seq_cst fence but also annotate memory accesses with acquire/release as necessary. " -- [1]

also, Peter Sewell pointed me to section 6.3 of https://www.cs.kent.ac.uk/people/staff/mjb211/docs/toc.pdf , which shows (among other things) that the C memory model can be simplified if you don't use fences.

i have two use-cases for fences:

a mechanism that a compiler can use to blindly make code SC by adding fences in between memory accesses
before you free a block of allocated memory, you need to guarantee that there are no delayed threads which will access it

Perhaps there are other ways to do each of these. I should ask Sewell and Batty.

---

OpenCL? has a (large) subset of C11 memory ops:

https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/atomicFunctions.html

https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/memory_order.html

they leave out memory_order_consume, but don't change too much else imo

---

apparently "(po U rf) acyclic" is the common way to try and rule out OOTA. eg [2]. Why don't ppl like that?

https://escholarship.org/content/qt2vm546k1/qt2vm546k1.pdf suggests that there are two main ways to rule out OOTA, enforcing dependency ordering, and enforcing load-store ordering. I guess i see why ppl don't want to do either of those.

dependency ordering is hard to compile and i guess if load-store reordering is so important to the RISC-V guys then we don't want to rule it out.

" Cost of Forbidding Load-Store ReorderingUnlike? the dependency-preserving approach, forbidding load-store reordering to avoid out-of-thin-air behaviors only affects relaxed atomics in C/C++11. Hence, for example, theload-store-order-preserving approach should impose no overhead on the SPEC CPU2006benchmarks because they do not use any C/C++ relaxed atomics. We believe that re-laxed atomics will primarily appear in concurrent data structure code, while most otherprogram code would not be affected since they would likely use other primitives that pro-vide stronger semantics, e.g., locks and atomics withmemoryorderseqcst. "

ok, from the same link, apparently an OOTA example is:

Thread 1

r1 = x; y = r1;

Thread 2
r2 = y; x = r2;

where everything is memory_order_relaxed. In C/C++ apparently r1=r2=42 is allowed.

Actually i don't see anything wrong here. My take is that it's the programmer's fault that a cycle is created. This proram should be declared illegal.

i guess in a situation like OVMhigh or Java, where you don't want to forge pointers, you might worry about it though. but i think the 'must read from a value that was placed there after EVENT' restriction solves that.

---

if load-store reordering is so important to the RISC-V guys, and it only affects relaxed atomics in C/C++11, then i guess we do want relaxed atomics?

no, that doesn't follow; load-store reordering can be used for non-atomic accesses in any case.

---

http://altair.cs.oswego.edu/pipermail/memory-model-design/2018-July/000089.html is skeptical of cost/benefit ratio for quantum atomics:

" My intuition is also that quantum atomics are too weak for as a general purpose memory_order_relaxed replacement. We commonly write code that does an initial relaxed load of an atomic value (e.g a Java lock-word) parses the results, and then proceeds accordingly. I don't think we really want to reason about what would happen if we read a random value, and about all the nonsensical control paths that might trigger. "

---

"Sequential consistency is known for its simplicity [26], and in-deed, any C11 or OpenCL? program usingexclusivelySC atomicswould enjoy a simple interleaving semantics. However, when com-bined with the more relaxed memory orders that C11 and OpenCLalso? provide, the semantics of SC atomics becomes highly complex,and it is this complexity that we tackle in this paper

https://arxiv.org/pdf/1503.07073.pdf

Overhauling SC Atomics in C11 and OpenCLMark? BattyUniversity? of Kent, UKm.j.batty@kent.ac.ukAlastair F. DonaldsonImperial? College London, UKalastair.donaldson@imperial.ac.ukJohn Wickerson

---

https://arxiv.org/pdf/1503.07073.pdf provides some suggestions to simplify the formalization of the C/C++ memory consistency order model

---

this paper mentions my question of how to deal with malloc and demallocing atomics:

https://www.cs.kent.ac.uk/people/staff/mjb211/docs/the_problem_of_programming_language_concurrency_semantics.pdf

their issue is a little different and more insurmountable than mine, but they don't have a great answer for theirs.

---

i'm not sure anymore that it's a good idea to leave out RCpc:

C++ now has a function std::reduce, which requires that the reduction operator is both associative and commutative. Since this was a key motivation for my use of relaxed atomics, let's see which memory_order it uses.

as of a few months ago, it wasn't implemented yet in either GCC or Clang [3] , but it is in:

https://github.com/intel/parallelstl

in that, it seems to be implemented here:

https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/include/tbb/parallel_reduce.h

the key part is probably near the top, starting at line 42:

    //! Task type used to combine the partial results of parallel_reduce.
    /** @ingroup algorithms */
    template<typename Body>
    class finish_reduce: public flag_task {
        //! Pointer to body, or NULL if the left child has not yet finished.
        bool has_right_zombie;
        const reduction_context my_context;
        Body* my_body;
        aligned_space<Body> zombie_space;
        finish_reduce( reduction_context context_ ) :
            has_right_zombie(false), // TODO: substitute by flag_task::child_stolen?
            my_context(context_),
            my_body(NULL)
        {
        }
        ~finish_reduce() {
            if( has_right_zombie )
                zombie_space.begin()->~Body();
        }
        task* execute() __TBB_override {
            if( has_right_zombie ) {
                // Right child was stolen.
                Body* s = zombie_space.begin();
                my_body->join( *s );
                // Body::join() won't be called if canceled. Defer destruction to destructor
            }
            if( my_context==left_child )
                itt_store_word_with_release( static_cast<finish_reduce*>(parent())->my_body, my_body );
            return NULL;
        }
        template<typename Range,typename Body_, typename Partitioner>
        friend class start_reduce;
    };

you can see one thing that looks like an atomic here, 'itt_store_word_with_release', which looks like it atomically stores the result of the reduction. Elsewhere in the file there is also 'itt_load_word_with_acquire', and there is also some sort of allocation stuff and task stuff which may be concurrency library calls. So let's find the implementation of 'itt_store_word_with_release' and see what memory_order it is. It appears to be from:

 tbb/src/tbbmalloc/Customize.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/src/tbbmalloc/Customize.h

        template <typename T>
        inline void itt_store_word_with_release(T& dst, T src) {

if TBB_USE_THREADING_TOOLS call_itt_notify(releasing, &dst);
endif TBB_USE_THREADING_TOOLS FencedStore?(*(intptr_t*)&dst, src); }

so it's really FencedStore?. FencedStore? appears to be from

 tbb/src/tbbmalloc/Synchronize.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/src/tbbmalloc/Synchronize.h

inline void FencedStore?