here's some discussion on RCsc vs RCpc for RISC-V:
https://lists.riscv.org/g/tech-memory-model/message/1123
---
someone else noted https://lists.riscv.org/g/tech-memory-model/message/1164 that in general, you can have (and ARMv8 does) SC-annotated loads and stores that are not acquires and releases.
---
OK i finished skimming through the tech-memory-model RISC-V list. It's a shame that the archives aren't publicly visible, it's a goldmine that i'm sure would be useful to memory model students and other professional memory model people of the future, but without public visibility they won't get Internet Archived and they'll probably get lost sometime.
Those guys are super-experts who can compose partial orders and relate that to operational microarchitectural details in their sleep. And yet a number of issues came up at every step of the way that they had to work out. Which makes me, if anything, even more hesitant to muck around with this stuff than i was.
anyway here's my current takeaway w/r/t what kind of model we need for Boot:
---
btw the FENCE instruction may be troublesome, not only does CppMem? not support it, but ThreadSanitizer? doesn't and doesn't plan to:
" > Using ifdef's is in principle possible, though improving the tool to better handle standalone fences has its own value I believe.
Not handling standalone acquire/release fences is somewhat intentional. The problem is that they have global effect on thread synchronization. Namely, a release fence effectively turns all subsequent relaxed stores into release stores, and synchronization operations are costly to model as compared to relaxed atomic operations. An acquire fence is even worse because it turns all preceding relaxed loads into acquire loads, which means that when we handle all relaxed loads as not-yet-materialized acquire loads just in case the thread will ever execute an acquire fence later.
Even if we implement handling of acquire/release fences in tsan as an optional feature, chances are that we won't be able to enable this mode on our projects because they are large in all respects (to put it mildly).
Combining memory ordering constraints with memory access also leads to more readable code. Which is usually a better trade-off for everything except the most performance-critical code (e.g. TBB).
But, yes, it's incomplete modelling of C/C++ memory model. We just happened to get away that far (checking 100+MLOC) without stand-alone fences.
> If not that, I think we would need to better understand what exactly cannot be handled well by the tool, in order to recognize these patterns in the code and find out how to change it.
As far as I can tell that's only stand-alone acquire/release fences. seq_cst fences should work as long as they are not used also as acquire/release fences, e.g. in the test.cpp above we use seq_cst fence but also annotate memory accesses with acquire/release as necessary. " -- [1]
also, Peter Sewell pointed me to section 6.3 of https://www.cs.kent.ac.uk/people/staff/mjb211/docs/toc.pdf , which shows (among other things) that the C memory model can be simplified if you don't use fences.
i have two use-cases for fences:
Perhaps there are other ways to do each of these. I should ask Sewell and Batty.
---
OpenCL? has a (large) subset of C11 memory ops:
https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/atomicFunctions.html
https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/memory_order.html
they leave out memory_order_consume, but don't change too much else imo
---
apparently "(po U rf) acyclic" is the common way to try and rule out OOTA. eg [2]. Why don't ppl like that?
https://escholarship.org/content/qt2vm546k1/qt2vm546k1.pdf suggests that there are two main ways to rule out OOTA, enforcing dependency ordering, and enforcing load-store ordering. I guess i see why ppl don't want to do either of those.
dependency ordering is hard to compile and i guess if load-store reordering is so important to the RISC-V guys then we don't want to rule it out.
" Cost of Forbidding Load-Store ReorderingUnlike? the dependency-preserving approach, forbidding load-store reordering to avoid out-of-thin-air behaviors only affects relaxed atomics in C/C++11. Hence, for example, theload-store-order-preserving approach should impose no overhead on the SPEC CPU2006benchmarks because they do not use any C/C++ relaxed atomics. We believe that re-laxed atomics will primarily appear in concurrent data structure code, while most otherprogram code would not be affected since they would likely use other primitives that pro-vide stronger semantics, e.g., locks and atomics withmemoryorderseqcst. "
ok, from the same link, apparently an OOTA example is:
Thread 1
| Thread 2 | |
| r2 = y; x = r2; |
where everything is memory_order_relaxed. In C/C++ apparently r1=r2=42 is allowed.
Actually i don't see anything wrong here. My take is that it's the programmer's fault that a cycle is created. This proram should be declared illegal.
i guess in a situation like OVMhigh or Java, where you don't want to forge pointers, you might worry about it though. but i think the 'must read from a value that was placed there after EVENT' restriction solves that.
---
if load-store reordering is so important to the RISC-V guys, and it only affects relaxed atomics in C/C++11, then i guess we do want relaxed atomics?
no, that doesn't follow; load-store reordering can be used for non-atomic accesses in any case.
---
http://altair.cs.oswego.edu/pipermail/memory-model-design/2018-July/000089.html is skeptical of cost/benefit ratio for quantum atomics:
" My intuition is also that quantum atomics are too weak for as a general purpose memory_order_relaxed replacement. We commonly write code that does an initial relaxed load of an atomic value (e.g a Java lock-word) parses the results, and then proceeds accordingly. I don't think we really want to reason about what would happen if we read a random value, and about all the nonsensical control paths that might trigger. "
---
"Sequential consistency is known for its simplicity [26], and in-deed, any C11 or OpenCL? program usingexclusivelySC atomicswould enjoy a simple interleaving semantics. However, when com-bined with the more relaxed memory orders that C11 and OpenCLalso? provide, the semantics of SC atomics becomes highly complex,and it is this complexity that we tackle in this paper
https://arxiv.org/pdf/1503.07073.pdf
Overhauling SC Atomics in C11 and OpenCLMark? BattyUniversity? of Kent, UKm.j.batty@kent.ac.ukAlastair F. DonaldsonImperial? College London, UKalastair.donaldson@imperial.ac.ukJohn Wickerson
---
https://arxiv.org/pdf/1503.07073.pdf provides some suggestions to simplify the formalization of the C/C++ memory consistency order model
---
this paper mentions my question of how to deal with malloc and demallocing atomics:
their issue is a little different and more insurmountable than mine, but they don't have a great answer for theirs.
---
i'm not sure anymore that it's a good idea to leave out RCpc:
C++ now has a function std::reduce, which requires that the reduction operator is both associative and commutative. Since this was a key motivation for my use of relaxed atomics, let's see which memory_order it uses.
as of a few months ago, it wasn't implemented yet in either GCC or Clang [3] , but it is in:
https://github.com/intel/parallelstl
in that, it seems to be implemented here:
the key part is probably near the top, starting at line 42:
//! Task type used to combine the partial results of parallel_reduce.
/** @ingroup algorithms */
template<typename Body>
class finish_reduce: public flag_task {
//! Pointer to body, or NULL if the left child has not yet finished.
bool has_right_zombie;
const reduction_context my_context;
Body* my_body;
aligned_space<Body> zombie_space;
finish_reduce( reduction_context context_ ) :
has_right_zombie(false), // TODO: substitute by flag_task::child_stolen?
my_context(context_),
my_body(NULL)
{
}
~finish_reduce() {
if( has_right_zombie )
zombie_space.begin()->~Body();
}
task* execute() __TBB_override {
if( has_right_zombie ) {
// Right child was stolen.
Body* s = zombie_space.begin();
my_body->join( *s );
// Body::join() won't be called if canceled. Defer destruction to destructor
}
if( my_context==left_child )
itt_store_word_with_release( static_cast<finish_reduce*>(parent())->my_body, my_body );
return NULL;
}
template<typename Range,typename Body_, typename Partitioner>
friend class start_reduce;
};you can see one thing that looks like an atomic here, 'itt_store_word_with_release', which looks like it atomically stores the result of the reduction. Elsewhere in the file there is also 'itt_load_word_with_acquire', and there is also some sort of allocation stuff and task stuff which may be concurrency library calls. So let's find the implementation of 'itt_store_word_with_release' and see what memory_order it is. It appears to be from:
tbb/src/tbbmalloc/Customize.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/src/tbbmalloc/Customize.h
template <typename T>
inline void itt_store_word_with_release(T& dst, T src) {so it's really FencedStore?. FencedStore? appears to be from
tbb/src/tbbmalloc/Synchronize.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/src/tbbmalloc/Synchronize.h
inline void FencedStore?