Bayle Shanks's website: proj-oot-ootAssemblyNotes26

here's some discussion on RCsc vs RCpc for RISC-V:

https://lists.riscv.org/g/tech-memory-model/message/1123

---

someone else noted https://lists.riscv.org/g/tech-memory-model/message/1164 that in general, you can have (and ARMv8 does) SC-annotated loads and stores that are not acquires and releases.

---

OK i finished skimming through the tech-memory-model RISC-V list. It's a shame that the archives aren't publicly visible, it's a goldmine that i'm sure would be useful to memory model students and other professional memory model people of the future, but without public visibility they won't get Internet Archived and they'll probably get lost sometime.

Those guys are super-experts who can compose partial orders and relate that to operational microarchitectural details in their sleep. And yet a number of issues came up at every step of the way that they had to work out. Which makes me, if anything, even more hesitant to muck around with this stuff than i was.

anyway here's my current takeaway w/r/t what kind of model we need for Boot:

i don't completely understand the discussion on multi-copy-atomicity. A lot of people seemed to be strongly against it early on and then when the proposal for scopes, which i don't fully understand yet, came up everyone was then like, 'oh okay in that case it's fine'. The scopes proposal said that every memory ordering command would have a scope, but i don't see that in the final spec, so are those people still happy? They seem to be. Also, apparently load-store reordering takes care of a lot of the demands for further weakness, because it allows you to transmit memory operations from processors to memory storage over a fabric with variable latency. Anyhow, the upshot for me is that my worry that the RVWMO model is not weak enough for whatever i want to do, b/c of multi-copy atomicity may not be true. Two theory teams argued very forcefully that RVWMO, with its load-store reordering, was already too weak and that the memory model could be a lot simpler if they'd only give that up. Then again, what i want to do (truly massive parallelism) is rather exotic so i might need more weakness than others.
the idea that RVWMO can't be locally compiled to weak C++ (as i think i showed (modulo the possibility of fences in the translation, which CppMem? can't test; but i doubt they will change the outcome)) with weak memory orders means that i probably don't think RVWMO should be our memory model, at least not the sole one.
So, combining the trouble compiling RVWMO to weak C++, and the fear that RVWMO isn't weak enough, therefore i guess i'm deciding that it shouldn't be the sole memory model. So the next contender is C++20.
one thing i learned from the discussion is that some think that RVWMO is 'a lot like ARMv8'. I don't understand the details enough to make my own judgement but presumably RVWMO was created slightly later and with at least a little input from some of the people who worked on ARMv8 (although otoh a lot of RVWMO had been decided before the people at RISC-V were informed about ARMv8, so maybe not), and at least with less concern about backwards-compatibility.
so the remaining big question is, should we add a RVWMO memory_order to C++20's memory_orders?
as before, the pro is that RVWMO is well-thought out, recent, fills a useful niche, and qualitatively different from C++20's other memory_orders. The con is (a) the simple complexity of just having more 'stuff' in our language, and (b) now we have to define how C++20's memory consistency model and RISC-V's memory consistency model interoperate.
my sense of the role of RVWMO is the same as before, but i have even more evidence for this now. RVWMO is balancing formal simplicity with what chip-builders need, for moderately parallel systems. It eschews stuff (like non-multi-copy-atomicity and non-dependency-ordering) that is found not to be practically useful, in order to make the formal model simpler. One note however is that there are things that are useful to low-level models, like ordering dependencies, which are not so great for HLL models; so the idea that RVWMO might be good for a portable VM just because it's good for hardware needs to be examined critically (and in fact just now Peter Sewell replied to my list emails with a suggestion that it may not be useful, at least for the goal of "transforming C++ code in order to simplify the parts C/C++ concurrency model that you have deal with").
- And i think it succeeds; i feel like i sort of understand much of the RISC-V memory consistency ordering spec already, and that i could understand the rest with a few days of study if i needed to (surely this is an illusion of competence, but compared to the C++ memory consistency model spec there's a big difference; i've skimmed both specs, and right now i feel like i could understand the C++ memory consistency model eventually but it would take a long, long time; when i have an unanswered question with RISC-V's memory consistency model spec i feel like i can dig in right now and at least attempt to answer it myself, but when i have an unanswered question with C++s's i just think it's hopeless unless i can budget a few days to learn more about the C++ model first).
so, if the RVWMO model is so good, it's a shame it doesn't compile to C++ in a reasonable way, but otoh, the fact that it's not easily reducible to C++'s memory orders also means that it's different enough to be useful.
being the way i am, i'm leaning towards adding in RVWMO too. But then i'd have to think about how it relates to C++. Also, is it okay for me to just muck with it? for example, i could just delete RVWMO rules 9-11 (syntactic dependencies). My guess is that that wouldn't have any bad effects on global properties of the memory model, but of course any change like that means the tooling (e.g. RMEM) doesn't quite match up. Similarly, mixing the RVWMO and C++ memory models at all will prevent users from completely relying on either RMEM or CppMem?.
i should also list some of the 'features'/design choices of the RVWMO that people seemed to think were important:
- multi-copy-atomicity
- load-store reordering
- dependency ordering
- RCsc release/acquire

---

btw the FENCE instruction may be troublesome, not only does CppMem? not support it, but ThreadSanitizer? doesn't and doesn't plan to:

" > Using ifdef's is in principle possible, though improving the tool to better handle standalone fences has its own value I believe.

Not handling standalone acquire/release fences is somewhat intentional. The problem is that they have global effect on thread synchronization. Namely, a release fence effectively turns all subsequent relaxed stores into release stores, and synchronization operations are costly to model as compared to relaxed atomic operations. An acquire fence is even worse because it turns all preceding relaxed loads into acquire loads, which means that when we handle all relaxed loads as not-yet-materialized acquire loads just in case the thread will ever execute an acquire fence later.

Even if we implement handling of acquire/release fences in tsan as an optional feature, chances are that we won't be able to enable this mode on our projects because they are large in all respects (to put it mildly).

Combining memory ordering constraints with memory access also leads to more readable code. Which is usually a better trade-off for everything except the most performance-critical code (e.g. TBB).

But, yes, it's incomplete modelling of C/C++ memory model. We just happened to get away that far (checking 100+MLOC) without stand-alone fences.

> If not that, I think we would need to better understand what exactly cannot be handled well by the tool, in order to recognize these patterns in the code and find out how to change it.

As far as I can tell that's only stand-alone acquire/release fences. seq_cst fences should work as long as they are not used also as acquire/release fences, e.g. in the test.cpp above we use seq_cst fence but also annotate memory accesses with acquire/release as necessary. " -- [1]

also, Peter Sewell pointed me to section 6.3 of https://www.cs.kent.ac.uk/people/staff/mjb211/docs/toc.pdf , which shows (among other things) that the C memory model can be simplified if you don't use fences.

i have two use-cases for fences:

a mechanism that a compiler can use to blindly make code SC by adding fences in between memory accesses
before you free a block of allocated memory, you need to guarantee that there are no delayed threads which will access it

Perhaps there are other ways to do each of these. I should ask Sewell and Batty.

---

OpenCL? has a (large) subset of C11 memory ops:

https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/atomicFunctions.html

https://www.khronos.org/registry/OpenCL/sdk/2.0/docs/man/xhtml/memory_order.html

they leave out memory_order_consume, but don't change too much else imo

---

apparently "(po U rf) acyclic" is the common way to try and rule out OOTA. eg [2]. Why don't ppl like that?

https://escholarship.org/content/qt2vm546k1/qt2vm546k1.pdf suggests that there are two main ways to rule out OOTA, enforcing dependency ordering, and enforcing load-store ordering. I guess i see why ppl don't want to do either of those.

dependency ordering is hard to compile and i guess if load-store reordering is so important to the RISC-V guys then we don't want to rule it out.

" Cost of Forbidding Load-Store ReorderingUnlike? the dependency-preserving approach, forbidding load-store reordering to avoid out-of-thin-air behaviors only affects relaxed atomics in C/C++11. Hence, for example, theload-store-order-preserving approach should impose no overhead on the SPEC CPU2006benchmarks because they do not use any C/C++ relaxed atomics. We believe that re-laxed atomics will primarily appear in concurrent data structure code, while most otherprogram code would not be affected since they would likely use other primitives that pro-vide stronger semantics, e.g., locks and atomics withmemoryorderseqcst. "

ok, from the same link, apparently an OOTA example is:

Thread 1

r1 = x; y = r1;

Thread 2
r2 = y; x = r2;

where everything is memory_order_relaxed. In C/C++ apparently r1=r2=42 is allowed.

Actually i don't see anything wrong here. My take is that it's the programmer's fault that a cycle is created. This proram should be declared illegal.

i guess in a situation like OVMhigh or Java, where you don't want to forge pointers, you might worry about it though. but i think the 'must read from a value that was placed there after EVENT' restriction solves that.

---

if load-store reordering is so important to the RISC-V guys, and it only affects relaxed atomics in C/C++11, then i guess we do want relaxed atomics?

no, that doesn't follow; load-store reordering can be used for non-atomic accesses in any case.

---

http://altair.cs.oswego.edu/pipermail/memory-model-design/2018-July/000089.html is skeptical of cost/benefit ratio for quantum atomics:

" My intuition is also that quantum atomics are too weak for as a general purpose memory_order_relaxed replacement. We commonly write code that does an initial relaxed load of an atomic value (e.g a Java lock-word) parses the results, and then proceeds accordingly. I don't think we really want to reason about what would happen if we read a random value, and about all the nonsensical control paths that might trigger. "

---

"Sequential consistency is known for its simplicity [26], and in-deed, any C11 or OpenCL? program usingexclusivelySC atomicswould enjoy a simple interleaving semantics. However, when com-bined with the more relaxed memory orders that C11 and OpenCLalso? provide, the semantics of SC atomics becomes highly complex,and it is this complexity that we tackle in this paper

https://arxiv.org/pdf/1503.07073.pdf

Overhauling SC Atomics in C11 and OpenCLMark? BattyUniversity? of Kent, UKm.j.batty@kent.ac.ukAlastair F. DonaldsonImperial? College London, UKalastair.donaldson@imperial.ac.ukJohn Wickerson

---

https://arxiv.org/pdf/1503.07073.pdf provides some suggestions to simplify the formalization of the C/C++ memory consistency order model

---

this paper mentions my question of how to deal with malloc and demallocing atomics:

https://www.cs.kent.ac.uk/people/staff/mjb211/docs/the_problem_of_programming_language_concurrency_semantics.pdf

their issue is a little different and more insurmountable than mine, but they don't have a great answer for theirs.

---

i'm not sure anymore that it's a good idea to leave out RCpc:

C++ now has a function std::reduce, which requires that the reduction operator is both associative and commutative. Since this was a key motivation for my use of relaxed atomics, let's see which memory_order it uses.

as of a few months ago, it wasn't implemented yet in either GCC or Clang [3] , but it is in:

https://github.com/intel/parallelstl

in that, it seems to be implemented here:

https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/include/tbb/parallel_reduce.h

the key part is probably near the top, starting at line 42:

    //! Task type used to combine the partial results of parallel_reduce.
    /** @ingroup algorithms */
    template<typename Body>
    class finish_reduce: public flag_task {
        //! Pointer to body, or NULL if the left child has not yet finished.
        bool has_right_zombie;
        const reduction_context my_context;
        Body* my_body;
        aligned_space<Body> zombie_space;
        finish_reduce( reduction_context context_ ) :
            has_right_zombie(false), // TODO: substitute by flag_task::child_stolen?
            my_context(context_),
            my_body(NULL)
        {
        }
        ~finish_reduce() {
            if( has_right_zombie )
                zombie_space.begin()->~Body();
        }
        task* execute() __TBB_override {
            if( has_right_zombie ) {
                // Right child was stolen.
                Body* s = zombie_space.begin();
                my_body->join( *s );
                // Body::join() won't be called if canceled. Defer destruction to destructor
            }
            if( my_context==left_child )
                itt_store_word_with_release( static_cast<finish_reduce*>(parent())->my_body, my_body );
            return NULL;
        }
        template<typename Range,typename Body_, typename Partitioner>
        friend class start_reduce;
    };

you can see one thing that looks like an atomic here, 'itt_store_word_with_release', which looks like it atomically stores the result of the reduction. Elsewhere in the file there is also 'itt_load_word_with_acquire', and there is also some sort of allocation stuff and task stuff which may be concurrency library calls. So let's find the implementation of 'itt_store_word_with_release' and see what memory_order it is. It appears to be from:

 tbb/src/tbbmalloc/Customize.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/src/tbbmalloc/Customize.h

        template <typename T>
        inline void itt_store_word_with_release(T& dst, T src) {

if TBB_USE_THREADING_TOOLS call_itt_notify(releasing, &dst);
endif TBB_USE_THREADING_TOOLS FencedStore?(*(intptr_t*)&dst, src); }

so it's really FencedStore?. FencedStore? appears to be from

 tbb/src/tbbmalloc/Synchronize.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/src/tbbmalloc/Synchronize.h

inline void FencedStore?( volatile intptr_t &location, intptr_t value ) { __TBB_store_with_release(location, value); }

so it's really __TBB_store_with_release. __TBB_store_with_release appears to be from

 tbb/include/tbb/tbb_machine.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/include/tbb/tbb_machine.h

template<typename T, typename V> inline void __TBB_store_with_release(volatile T& location, V value) { machine_load_store<T,sizeof(T)>::store_with_release( location, T(value) ); }

so it's really machine_load_store::store_with_release. This appears to be defined in the same file, but also in other files. In this file, we have a few defns:

if __TBB_USE_GENERIC_HALF_FENCED_LOAD_STORE / Fenced operations use volatile qualifier to prevent compiler from optimizing them out, and on architectures with weak memory ordering to induce compiler to generate code with appropriate acquire/release semantics. On architectures like IA32, Intel64 (and likely Sparc TSO) volatile has no effect on code gen, and consistency helpers serve as a compiler fence (the latter being true for IA64/gcc as well to fix a bug in some gcc versions). This code assumes that the generated instructions will operate atomically, which typically requires a type that can be moved in a single instruction, cooperation from the compiler for effective use of such an instruction, and appropriate alignment of the data. / template <typename T, size_t S> struct machine_load_store { static T load_with_acquire ( const volatile T& location ) { T to_return = location; __TBB_acquire_consistency_helper(); return to_return; } static void store_with_release ( volatile T &location, T value ) { __TBB_release_consistency_helper(); location = value; } };

in general, plain load and store of 32bit compiler is not atomic for 64bit types

if __TBB_WORDSIZE==4 && __TBB_64BIT_ATOMICS template <typename T> struct machine_load_store<T,8> { static T load_with_acquire ( const volatile T& location ) { return (T)__TBB_machine_load8( (const volatile void*)&location ); } static void store_with_release ( volatile T& location, T value ) { __TBB_machine_store8( (volatile void*)&location, (int64_t)value ); } };
endif /* __TBB_WORDSIZE==4 && __TBB_64BIT_ATOMICS */
endif /* __TBB_USE_GENERIC_HALF_FENCED_LOAD_STORE */

what is __TBB_release_consistency_helper? But first, one of the other files that defines this is

 tbb/include/tbb/machine/gcc_generic.h https://github.com/intel/tbb/blob/cc2c04e2f5363fb8b34c10718ce406814810d1e6/include/tbb/machine/gcc_generic.h

template <typename T, size_t S> struct machine_load_store { static T load_with_acquire ( const volatile T& location ) { return __TBB_machine_atomic_load<T, __ATOMIC_ACQUIRE>(location); } static void store_with_release ( volatile T &location, T value ) { __TBB_machine_atomic_store<T, __ATOMIC_RELEASE>(location, value); } };

aha! this appears to be from GCC's atomics, which are modeled after C/C++11, and this appears to be the memory_order_release: http://www.rowleydownload.co.uk/arm/documentation/gnu/gcc/_005f_005fatomic-Builtins.html

ok, back to __TBB_release_consistency_helper. This is defined in many files, including tbb/include/tbb/machine/gcc_generic.h . There's an #if on the version of GCC. We'll look at the later version:

else __TBB_GCC_VERSION >= 40700; use __atomic_* builtins available since gcc 4.7
define __TBB_compiler_fence() __asm__ __volatile__("": : :"memory") Acquire and release fence intrinsics in GCC might miss compiler fence. Adding it at both sides of an intrinsic, as we do not know what reordering can be made.
define __TBB_acquire_consistency_helper() __TBB_compiler_fence(); __atomic_thread_fence(__ATOMIC_ACQUIRE); __TBB_compiler_fence()
define __TBB_release_consistency_helper() __TBB_compiler_fence(); __atomic_thread_fence(__ATOMIC_RELEASE); __TBB_compiler_fence()

So, yeah, again it's memory_order_release.

So, the use-case i most care about, parallel_reduce, appears to use memory_order_release and memory_order_acquire. I believe that's RCpc (not RCsc, not relaxed). The one memory order that i was hoping to leave out.

And it makes sense, because Relaxed doesn't sync at all, and SC is global; RCpc is local and yet provides some synchronization, so i can see how it could be useful.

---

so can i leave out relaxed? lets look at http://rsim.cs.illinois.edu/Pubs/msinclair-thesis.pdf again, section 3.3 3.3 Relaxed Atomic Use Cases and DRFRlx Model.

let's see.. Unpaired Atomics Use Case :: Work Queue sounds useful. Here one thread is checking if another thread has put anything in its mailbox. Until there is something in the mailbox, there is no need to sync.

The given Commutative Atomics use case is an event counter, which i'm not sure if i care about, although i still feel like it could be useful for parallel reduce. The Non-Ordering Atomics use-case is in some ways similar to the previous, although i guess it's different, and i guess i care about it.

Speculative Atomics uses sequence numbers and hence is either not usable in all situations or is not 100% deterministically safe (what if the sequence number wraps around...). So i don't really care for this case.

The quantum case may be a bit too far, as pointed out in http://altair.cs.oswego.edu/pipermail/memory-model-design/2018-July/000089.html and quoted above. Also, it requires significantly complicating the formal model.

So it looks like, if i don't need speculative or quantum, i can stick with DRFrlx Formal Definition (Version 2) on page 61 after the section on non-ordering.

So, looks like, unsurprisingly, Relaxed may be useful.

---

So i don't think i want to give up much of C/C++'s memory_consistency_model. So i have:

all memory_orders except for memory_consume
except in relaxed memory_order, acquires on loads, releases on stores, both on RMWs.
FENCE memory_domain is seq_cst

lw: non-atomic load lw.a: atomic load (SC). C atomic_load(memory_order_seq_cst) (RCsc) lw.a.rc: atomic load (release consistency). C atomic_load(memory_order_acquire) (RCpc) lw.a.rlx: atomic load (relaxed). C atomic_load(memory_order_relaxed) ... and the matching stores ... cas: C atomic_compare_exchange_weak(memory_order_seq_cst) cas.rc: C atomic_compare_exchange_weak(memory_order_acq_rel) cas.rlx: C atomic_compare_exchange_weak(memory_order_relaxed) and, not sure if i need this next one, but possibly: fence(memory_domain): C atomic_thread_fence(memory_order_seq_cst) (strengthened as per http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0668r5.html , section 'Strengthen SC fences') with an extra memory_domain argument indicating which shared memory domain to sync

(that's 12 instructions in total)

---

actually https://en.wikipedia.org/wiki/Release_consistency claims that RCpc is sufficient for DRF? http://pages.cs.wisc.edu/~markhill/theses/sarita_adve.pdf seems to me to disagree but i'm not sure i'm understanding correctly; he says "two flavors of release consistency(RCsc and RCpc)" and "A variant of release consistency where special accesses are sequentially consistent is also proposed [GLL90]. This modelis abbreviated as RCsc, while the model with processor consistent special accesses is abbreviated as RCpc. Gharachorloo et al. identify a set of software constraints for which a system that obeys RCsc appears sequentially consistent. Programs that obey these constraints are called properly labeled (PL) programs."

---

this thesis deals with giving a DRF-like result for RCpc (see the part about 'PLpc2'):

http://pages.cs.wisc.edu/~markhill/theses/sarita_adve.pdf

it'd good but imo complicated. i guess DRFrlx probably shares the same problem.

in both cases, maybe what they are saying is that, if you analyze your program carefully, even when synchronizing instructions are not always SC, you can often prove that your program is still SC. And they provide some theorems to prove this in the common cases. But the theorems' premises are sufficiently complicated that i think unless there is compiler/linter/automated reasoning support, or unless there was a serious demand for correctness in their application, most programmers would not bother.

so these are more of an analysis tool for later than something we really need right now, although they are useful to inform our design choices.

---

i'm kind of running out of time for this issue.

i think we'd better stick to a subset of the C/C++ memory consistency model rather than try to roll our own.

i think the C/C++ memory model formalization is way too complicated but i don't have time to learn how to work on it myself, and so i can't tell if there are any design choices i could make here that would help simplify. And my project isn't big enough to be worth the experts' attention. If we get traction later then maybe we can get someone to come in and simplify everything.

as to which memory_orders, i think we need a SC. And i think we don't want consume (or dependencies in general). And i think we do need some sort of local synchronizaton. So, in C/C++, that forces us to have seq_cst, acquire, release. And i don't see how Relaxed is that much more complicated; we already have to support non-atomic accesses, this is just like those but atomic.

And although the tooling and theories seem to think that fences are not worth the complication they bring, i don't really see how we can deal with free'ing a block of memory upon which ordinary (e.g. relaxed) accesses were performed in the past unless we use a fence.

Perhaps a fence is just like an acquire (SC) to a dummy location followed immediately by a release (SC). C/C++ SC fences don't seem to be that simple, because 'Strengthen SC fences' in [4] suggests that they are defined independently of that? We should put this in the documentation in case it's useful.

Wait, [5] says that their fix is to 'Model SC-fences as acquire/release atomic updates of a distinguished fence location.'. But these are the same people whose paper was the basis of 'Strengthen SC fences' in [6]. The RC11 paper says "Among these, Lahavet al.[17]proposed strengthening the semantics of SC fences in a dif-ferent way from the way we do here, by treating them asread-modify-writes to a distinguished location. That strength-ening, however, was considered in the restricted setting ofonly release/acquire accesses, and does not directly scaleto the full set of C11 access modes. In fact, for the frag-ment containing only SC fences and release/acquire accesses,RC11-consistency is equivalent to RA-consistency that treatsSC fences as RMWs to a distinguished location". So they are saying it's not as simple as that?

as for OOTA and UB, i think that we just say:

a single thread’s execution is consistent with the single thread semantics of the language, given the value returned by a shared memory read
a read returns the value of some write to the same address after ((some event TBD, such as a fence or an acquire or release; this part is important so that you don't have a read to memory which in the past was free'd and reallocated memory containing values from its previous incarnation))
i think in addition to the formal section saying "use C/C++20", i think i'd better have an 'informal section' where i informally describe my own understanding of all of these things! such as:
- FENCE as acquire_SC/release_SC to a dummy memory location
- acquire as preventing any instruction from moving from after it to before it, plus introducing happens-before from anything before any release that it reads-from, to anything after it. Release as the opposite.
- roach motel movement
- OOTA is undefined? or, must read something written to that memory location since EVENT?
- RCpc acquire..release to different variables can overlap, RCsc not. RCsc also being SC relative to everything else SC in the whole program
- everything SC in the whole program standing in a total order with each other, and also in program order within each thread (with each other)
- FENCE ordering everything before it before everything after it, and being SC
- FENCE before memory frees
- relaxed as like non-atomics but atomic
- coherence totally ordering all writes to each memory location
- a golang-like read rule? or RISC-V like?
- defn data race
- SC for DRF
- defn CAS weak
- no dependency ordering
- IRIW being allowed without SC
- how to say: no reorderings allowed unless they would be correct in a single-threaded program? each thread sees its own reads and writes correctly?
- PC-ness of 'RCpc's

---

" In the case of C/C++11, Batty et al.[2011]established the correctness of a mapping to x86-TSO, while Batty et al.[2012] addressed the mapping to POWER and ARMv7. However, the correctness claims of Batty et al.[2012] weresubsequently found to be incorrect [Lahav et al.2017; Manerkar et al.2016], as they mishandledthe combination of sequentially consistent accesses with weaker accesses. Lahav et al.[2017]developed RC11, a repaired version of C/C++11, and established (by pen-and-paper proof) thecorrectness of the suggested compilation schemes to x86-TSO, POWER and ARMv7." -- http://plv.mpi-sws.org/imm/paper.pdf

---

this seems to be a method for computing if a program with release/acquire satisifes a DRF theorem:

[7]

---

a comment near the bottom of https://lists.riscv.org/g/tech-memory-model/message/1123 appears to be saying that if you have RCpc lw.aq and sw.rl, and if you have a fence 'which orders all earlier release stores with all later acquire loads' then you can make those RCpc instructions into RCsc ones by adding that fence.

(note: "roach motel reordering" seems to be the term for when stuff outside of an acquire..release critical section moves into it. This can happen even with RCsc i thin. The other thing that can happen with RCpc that i think is prevented with RCsc is acquire..release..acquire..release and the second acquire moves before the first release)

hmm.. that only costs one opcode (for the fence) instead of 2 (for lw.acsc and sw.rlsc). But more generally, is it better to have a fence or instructions for this? The fence can't help but reorder ALL releases before it from ALL acquires afterwards, whereas RCsc is more specific? But what happens if you have "sw.rl; lw.aqsc" anyways? Can the sw.rl move past the lw.aqsc because it is not SC, or does the 'sc' in the lw.aqsc block it? I dunno, but at least with the fence idea, i know i wouldn't even be asking.

another question i have... if with an acquire 'All writes in other threads that release the same atomic variable are visible in the current thread', then i assume that applies to the same thread too, and that acquire..release..acquire..release can't overlap even with RCpc if both of the critical section (that is, the 'acquire..release's) are acquiring/releasing the same memory location (and indeed, by coherence (i think), all writes each single memory location are totally ordered anyways). So they're only overlapping if those are different variables. So, again, it seems to be that the difference between RCpc and RCsc is that RCsc requires global ordering, global over both threads, and over memory locations.

---

btw https://lore.kernel.org/lkml/20180705183836.GA3175@andrea/ suggests that the term RCsc is not even well-defined!

also, multiple times i've seen someone say in the RISC-V WG that the performance difference between RCpc and RCsc is small. That goes against my intuition that RCsc is a global sync operation and RCpc (assuming NON-cumulativity) is local between the threads concerned. But maybe my intuition is wrong? Or maybe the 'small performance difference' is in the context of RISC-V where multi-copy atomicity (a global notion of ordering) is already assumed?

also, remember that everyone but me is thinking of up to like 50 CPUs, where i'm thinking of a million.

---

old text in boot_reference that i am about to remove and rewrite:

Formal memory consistency model

put this somewhere: Perhaps a fence is just like an acquire (SC) to a dummy location followed immediately by a release (SC). C/C++ SC fences don't seem to be that simple, because 'Strengthen SC fences' in [8] suggests that they are defined independently of that? We should put this in the documentation in case it's useful.

put this somewhere: http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/index.html

TODO: delete all this and just use RVWMO. later: no, use C/C++

Note: this model is not entirely formal yet but it is meant to become so -- someday in the far, far distant future, only if Boot gets used a lot. Right now, it's probably horribly inconsistent.

todo

We have two goals here. (1) to specify the intra-thread-happens-before and inter-thread-happens-before partial orders, which represents the ordering guarantees given by the Boot platform, and then (2) to specify the set of values that can be returned by loads of memory.

It is a mathematical theorem that all partial orders are acyclic.

Specifying intra-thread-happens-before

By 'program order' we mean the dynamic order, not the static order. When we speak of instructions 'before' and 'after' each other in the same thread, we are implicitly speaking of this dynamic 'program order'. For example, if some instruction I is found within a loop, then there might be multiple instances of I in program order, one for each loop iteration, and we will say that the instances from earlier iterations come before the instances from any later iterations.

In order to specify intra-thread-happens-before, we first introduce another partial order, directly-intra-thread-happens-before. directly-intra-thread-happens-before represents synctactic and control dependencies between instructions in the same thread.

If an instruction A writes a value to a register or to the smallstack, and then if another instruction B is guaranteed by the semantics of single-threaded Boot programs to read that value from that register or smallstack location, then the: instruction A directly-intra-thread-happens-before B.

If instruction A is before a control instruction B in the same thread, and A is directly-intra-thread-happens-before B, then: A directly-intra-thread-happens-before (every instruction after B in the same thread).

intra-thread-happens-before is the transitive closure of the directly-intra-thread-happens-before partial order.

TODO: do we really want a separate intra-thread-happens-before ? the motivation of this is just to prevent dependencies from being necessarily globally observed, but maybe that's not worth the complexity. i think most other contemporary memory consistency models don't bother with this; but i think C++ does (see example under Relaxed Ordering in https://en.cppreference.com/w/cpp/atomic/memory_order )... but imo C++ looks more complicated than RVWMO... i'm beginning to see the light regarding the stronger semantics of RVWMO, if they make this stuff simpler....

Specifying inter-thread-happens-before

In order to specify intra-thread-happens-before, we first introduce another partial order, directly-intra-thread-happens-before. directly-inter-thread-happens-before represents the effects of ordering instructions.

If an ordering instruction A is before an ordering instruction B in the same thread, then: A directly-happens-before B.TODO: i think this is RCsc but we need RCpc?

(Every instruction before a FENCE instruction in the same thread) happens-before (that FENCE instruction), and also (that FENCE instruction) directly-happens-before (every instruction in the same thread after that FENCE instruction).

If an ordering instruction A reads from memory, then: A directly-happens-before (every instruction after A in the same thread). If an ordering instruction A writes to memory, then: (every instruction before A in the same thread) directly-happens-before A.

If A intra-thread-happens-before B, and B directly-happens-before C, then: A directly-happens-before C.

happens-before is the transitive closure of directly-happens-before.

Specifying allowed reads

TODO sc ordering instructions (not totally ordered in happens-before, but GLOBALLY totally ordered in each execution; RCpc are only locally totally ordered in each execution)

Boot instructions in the same thread must observe each other in the intra-thread-happens-before partial order.

TODO release syncing with previous acquires;

TODO acyclic "

---

an interesting example of transitivity/cumulativity using only C/C++ release/acquire memory_orders [9]. I think this would behave the same way in my own 'semiformal model', although mb for different reasons (my model doesn't speak of inter-thread-happens-before).

---

removed the following text from boot_reference:

" This is a work in progress. I am going to give the model in two ways. First, i'll give the mapping between memory operations and equivalent C/C++20 instructions, and specify that our memory consistency model is the same as the fragment of C/C++20's that covers these. Second, I'll give my own formal understanding of the model. Surely my own formal understanding is wrong, and so these will conflict, and in that case I'm not sure which is to be considered the right one for Boot. If Boot ever gets enough traction that it matters, perhaps we can recruit someone with enough expertise in memory consistency models to fix it.

Semi-formal memory consistency model, first try: C/C++20

Here I give the Boot memory operations, and for each one, the C/C++20 equivalent operation and memory_order that defines its semantics (note: actually you'd have to use the *_explicit form of many of these C/C++ functions in order to specify a non-default memory_order; for conciseness, we omit that here). Our memory consistency model is that of C/C++20, restricted to the memory operations provided below.

lw: non-atomic load
lw.a: atomic_load(memory_order_seq_cst)
lw.a.rc: atomic_load(memory_order_acquire)
lw.a.rlx: atomic_load(memory_order_relaxed)
sw: non-atomic store
sw.a: atomic_store(memory_order_seq_cst)
sw.a.rc: atomic_store(memory_order_release)
sw.a.rlx: atomic_store(memory_order_relaxed)
casrmw: atomic_compare_exchange_weak(memory_order_seq_cst)
casrmw.rc atomic_compare_exchange_weak(memory_order_acq_rel)
casrmw.rlx: atomic_compare_exchange_weak(memory_order_relaxed)
fence(memory_domain): atomic_thread_fence(memory_order_seq_cst), with an extra memory_domain argument indicating which shared memory domain to sync

This list is a list of equivalencies (within a single memory domain); that is, if a Boot program were to be compiled to C/C++, then each Boot operation on the left should be replaced by the indicated C/C++ operation on the right, and also if C/C++ were compiled to Boot, then each C/C++ operation on the right should be replaced by the indicated Boot operation on the left. A valid Boot program with multiple memory domains can be safely compiled to C/C++ by pretending that all of the memory_domains are the same. (so i guess in this model we don't really define the additional reorderings that could be observed due to multiple memory_domains).

Any other operations that involve reads or writes to memory are assumed to involve non-atomic loads and stores.

You can explore allowed and disallowed executions in C/C++ with the CppMem? tool: http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/index.html

An OOTA (out-of-thin-air) cycle is a read instruction r that takes its value from a write instruction w whose value was ultimately computed from r. Programs with possible executions with OOTA cycles are invalid (an implementation may refuse to compile them, or may halt while executing the program with a runtime error) but not undefined (if an implementation does not halt with an error, it must execute the program according to the specified semantics). "

---

stackoverflow question about IRIW in the cppreference example for seq_cst (the answer: yes, in C seq_cst is needed to prevent IRIW): https://stackoverflow.com/questions/50462948/acquire-release-vs-sequential-consistency-in-c11

---

stackoverflow-ish question about whether acquire/release are transitive (in a certain sense; if you have a write to a variable X before a release on a variable Y, and then in another thread an acquire on Y and then a release on a new variable Z, and then in a third thread an acquire on Z, now is that third thread guaranteed to see the write to X?): the answer is, yes they are: https://softwareengineering.stackexchange.com/questions/253537/in-c-are-acquire-release-memory-order-semantics-transitive

and btw as of 190726 my semi-formal semantics would also guarantee that the write to X would be seen

---

still thinking about making Boot 16-bit, esp. since now i'm planning just to implement BootX? directly. Could just say that it's 'at least 16-bit, but platform dependent', which allows it to be 32-bit on 64-bit systems. We already have abstracted the size of an int's representation in bytes/storage locations.

If the thread IDs are 16-bits, that's not enough for more than 64k threads, which may be too little -- however, explicit operations on threads aren't introduced until at least BootX?, anyhow.

---

my posts to the RISC-V groups about RVWMO:

https://groups.google.com/a/groups.riscv.org/forum/#!topic/isa-dev/SoJMEZSK700 https://lists.riscv.org/g/tech-memory-model/topic/mapping_lw_sw_to_c_c/32530032?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,32530032

---

mb only coherence property for atomic ops?:

"Note that as described above, the read-others’-write-early re- striction does not require any global serialization point for all oper- ations to all locations (e.g., a bus). It simply requires a serialization point for reads and writes to the same location, which is usually provided by the cache coherence protocol (the need for which is widely accepted). Further, this per-location serialization is techni- cally only required for atomic operations. Unfortunately, the use of fences for memory ordering obfuscates which memory accesses are the atomic ones; therefore, without a mechanism to distinguish in- dividual writes as atomic, our system must preserve write atomicity for all writes.

nah.. most other single-computer-hardware things i've seen have coherence for everything. in fact [10] may even have a stronger defn of coherence than we are using (i think we are only ordering the writes? but we have restrictions on which write a read may read elsewhere, so not sure if we are actually stronger).

---

i think the program-order relation (called sequenced-before in C) is a partial order in many languages

---

i think that what made RVWMO simple is also what made it too strong for me (multi-copy-atomicity). This prohibits the IRIW litmus test, which means that it must compile everything to the seq_cst memory_order when compiling to C/C++

---

" Another point to be noted is that consistent (or matching) axiomatic and operational definitions for a non-atomic memory model is yet to be provided. For example, the axiomatic definition in [2] is weaker than the operational definition in [8]. Whereas, in the case of atomic memories, (like TSO), people have proven the equivalence between axiomatic and operational models [9]. Last but not the least concern is that no two non-atomic memory models have any containment relation- ship. For example, the set of behaviors allowed by the POWER operational model [8] is neither a subset nor a superset of the set of behaviors allowed by the ARM operational model [4]." from ml primer

---

if in the memory model, we do dependencies, remember that loads don't carry a dependency (at least in the RISC-V memory model, they don't)

Atomicity needed for our relaxed atomic

---

arm's axiomatic memory consistency model initial paper is without arm version 8.3's weaker version of load acquire. But a later version has it in a different paper.

---

" Swift reserves a callee-preserved register for a method's self argument (pointer) to make repeated calls faster. Cool? "

---

dragontamer 3 days ago [-]

...

I think the acquire/release model of consistency will become more important in the coming years. PCIe 4.0 is showing signs of supporting acquire/release... ARM and POWER have added acquire/release model instructions, and even CUDA has acquire/release semantics being built.

As high-performance code demands faster-and-faster systems, the more-and-more relaxed our systems will become. Acquire/release is quickly becoming the standard model.

jabl 2 days ago [-]

> ARM and POWER have added acquire/release model instructions

They have implemented the acquire-release consistency model since day one (or, the day they started supporting multi-processors). Yes, there are some subtleties there that have in some cases been tightened later on, e.g. multi-copy atomicity.

dragontamer 2 days ago [-]

IIRC, there was a big stink because ARM and POWER historically implemented consume/release semantics, which is very slightly more relaxed than the now de-facto standard acquire/release semantics.

ARM and POWERPC CPU devs worked very hard to get consume/release into C++11, but no compiler writer actually implemented that part of the standard. As such, consume/release can be safely forgotten into the annals of computer history (much like DEC Alpha's fully relaxed semantics)

Then in ARM8, ARM simply added LDAR (Load-acquire) and STLR (Store-release) instructions. https://developer.arm.com/docs/100941/0100/barriers . So the ARM CPU how fully supports the acquire/release model. Apparently IBM's POWER instruction set was similarly strengthened to acquire/release (either POWER8 or POWER9).

ARM / POWER "normal" loads and stores are still consume/release semantics. But compilers can simply emit LDAR (load-acquire) for the stronger guarantee.

I remember at least one talk that showed that consume/release is ideal for things like Linux's RCU or something like that (that acquire/release is actually "too strong" for RCU, and therefore suboptimal). But because compiler-writers found consume/release too hard to reason about in practice, we're stuck with acquire/release.

It seems like the C++ standard continues to evolve to push for memory_order_consume (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p075...), but all the details are still up for discussion.

jabl 2 days ago [-]

> ARM and POWERPC CPU devs worked very hard to get consume/release into C++11

AFAIR Paul McKenney? was the primus motor, and the motivation was largely RCU. Then again, McKenney? also worked for IBM at the time and certainly had an interest in pushing a model that mapped well to POWER.

But it turned out to be both somewhat mis-specified and hard to implement cleanly, so most compilers just implemented it as an acquire.

As you mention, there is ongoing work to fix it.

As for ARM, it seems the big thing they've done since the initial release of ARMv8 is to banish non-multicopy atomicity. See https://www.cl.cam.ac.uk/~pes20/armv8-mca/armv8-mca-draft.pd...

---

One bitInterrupt like messages. 16 types. Receiving process sets one of four modes for each type:

Interrupt Overwrite And (All incoming messages must be one and the current state must be one for the state to be one otherwise it becomes zero) Or

So now we have a 32 bit interrupt mask to set these modes

could Generalize by having four bit messages rather than one bit -- Now the state is 64 bits. Would also like to have a counter Mode though (which triggers an interrupt upon overflow of the counter), As well as a repeater mode that merely forwards the message to one or more other processes. And ignore mode that drops the message

would Also like synapse mode which I guess is similar to the counter mode. You need to be able to set how many impulses come in before overflow however

Maybe we could have 16 modes in which case the Interrupt mask and the state Vector are both 64 bits . Maybe 4 supermodes with 2 bit argument. Maybe in addition to the 4 bit per type state, There are also a few 16 bit registers

How to do stuff like https://www.izhikevich.org/publications/spikes.htm

---

wouldn't it be great to have a hierarchy of parallel processors, with addressable numbers of processors, amount of memory per processor, and and bitwidth, going from small to large, and instruction set architecture going from simple to complex?

e.g.:

bitwidth 4: 16 x processors each with 16 nibbles of local memory bitwidth 8: 256 x (bitwidth 4 clusters) each with 256 bytes of local memory bitwidth 16: 64k x (bitwidth 8 clusters) each with 64k of local memory bitwidth 32: up to 4g c (bitwidth 16 clusters) each with up to 4g local memory

---

yknow, mb Boot should really just be a register machine (sigh). It's simpler. Mb BootX? can have stacks.

we could have a 32-bit machine with 2 banks of 8 regs (32-bit ints, and ptrs) (or mb 16 regs?). 16 bit instructions with no instruction length encoding(!); 4 opcode bits and 3 operands of 4 bits each. 15 3-operand instructions, 15 2-operand instructions, 15 1-operand instructions, and 16 0-operand instructions (could reduce operand size if we need more opcodes).

just 32-bit integer arithmetic, and also loads/stores of 8-bit and 16-bit subwords. no stacks, no floating point(?), no magic regs except 0. Facilities to deal with different instruction sizes and word sizes in memory, and interop calls (and my indirect control flow restrictions; these assist with interop). Dead simple compatibility layer. Similar to 32-bit integer subset of WASM/Cranelift (aside: woah, as of 19/11 cranelift has gotten a lot more complicated; use the old version https://web.archive.org/web/20180907213959/https://cranelift.readthedocs.io/en/latest/ir.html ), IRE, RCPU, HSVM, RISC-V, Transputer, etc. or mb we can even stuff in floating point and 64-bit ints at the expense of some of those atomicity instructions that we have right now (i bet we can). Really just go to town on the concept of 'the simple ISA inside Cranelift and RISC-V (etc; be sure and consult my Concordance, including the appendix with my analysis of the RISC-V geneology paper) that is trying to get out, adapted for interop'. The previous work on Boot isn't wasted; it allowed me to see what sort of simple instructions and numbers of registers can easily fit into the more compressed encodings that i ultimately want.

then BootX? has the smallstacks, the stack registers, the variable length encoding with 8/16/32 bits, floating point. Not compatible with Boot at all b/c Boot has no encoding format bits (however, shared instructions do share semantics). Similar to WASM/Cranelift, RISC-V. Or, if we fit enough floating point and 64-bit arith into the new Boot, then we can dispense with BootX? and put everything else (including the compressed 16-bit encoding) into OVM.

then Ovm has addressing modes etc. Similar to JVM, CLR, etc.

---

ok so let's explore the idea of making Boot more conventional (no stack, 32-bit, floating point), getting rid of BootX?, and stuffing all the interesting stuff that we are removing from Boot, as well as BootX?, into OVM.

What we need to make is the new instruction set for Boot. the current instruction set, from constants.py:

jpc # bit pattern must be all 0s so that the 'bad0' alias refers to something illegal jk

add and ann ap appi # add (constant * INT_SZ) to pointer apwi # add (constant * PTR_SZ) to pointer casrmw # cat integer atomic, sequential consistency (AMO/RMW) casrmwrc # cas integer atomic, release consistency (both acquire and release) (AMO/RMW) casrmwrlx # cas integer atomic, relaxed (AMO/RMW) casprmw # cat pointer atomic, sequential consistency (AMO/RMW) casprmwrc # cas pointer atomic, release consistency (both acquire and release) (AMO/RMW) casprmwrlx # cas pointer atomic, relaxed (AMO/RMW)

malloc_shared mul or sll srs sru sub xor

addrmw # add atomic, sequential consistency (AMO/RMW) # todo mb remove? aprmw # ap atomic, sequential consistency (AMO/RMW) # todo mb remove? andrmw # and atomic, sequential consistency (AMO/RMW) ## todo mb remove? orrmw # or atomic, sequential consistency (AMO/RMW) ## todo mb remove? xorrmw # xor atomic, sequential consistency (AMO/RMW) ## todo mb remove?

put the branches together so that the decoder can check for a simple
opcode range to decide if it needs to apply the special branch offset
interpretation to the op2 immediate beq beqp bne bnep blt

sysinfo in out inp outp devop

lk mfree lpc jt jy fence

break

MOVs/copies between registers cp cpp
MOVs/copies between registers and stack crsi # copy to register from stack, integer crsp # copy to register from stack, integer csri # copy to stack from register, integer csrp # copy to stack from register, pointer
loads lb lbu lh lhu lp lpa # load pointer, atomic, sequential consistency. Must be aligned. Like a C atomic_load_explicit with memory_order_seq_cst. lprc # load pointer, atomic, release consistency RCpc (acquire). Must be aligned. Like a C atomic_load_explicit with memory_order_acquire. lprlx # load pointer, atomic, unordered. Must be aligned. Like a C atomic_load_explicit with memory_order_relaxed. lw lwa # load word, atomic, sequential consistency. Must be aligned. Like a C atomic_load_explicit with memory_order_seq_cst. lwrc # load word, atomic, release consistency RCpc (acquire). Must be aligned. Like a C atomic_load_explicit with memory_order_acquire. lwrlx # load word, atomic, unordered. Must be aligned. Like a C atomic_load_explicit with memory_order_relaxed.

malloc_local

stores sb sh sp spa # store pointer, atomic, sequential consistency. Must be aligned. Like a C atomic_store_explicit with memory_order_seq_cst. sprc # store pointer, atomic, release consistency RCpc (release). Must be aligned. Like a C atomic_store_explicit with memory_order_release. sprlx # store pointer, atomic, unordered. Must be aligned. Like a C atomic_store_explicit with memory_order_relaxed. sw
swaps swapi # swap two registers, integer (note: if both register are $1, this swaps the top two stack locs) swapp # swap two registers, pointer (note: if both register are &1, this swaps the top two stack locs)
more stores swa # store word, atomic, sequential consistency. Must be aligned. Like a C atomic_store_explicit with memory_order_seq_cst. swrc # store word, atomic, release consistency RCpc (release). Must be aligned. Like a C atomic_store_explicit with memory_order_release. swrlx # store word, atomic, unordered. Must be aligned. Like a C atomic_store_explicit with memory_order_relaxed.

xrets0 xretsi xretsp xretr5 xretr12 xcalls0 xcallsi xcallsp xcallsii xcallspp xcallsip xcallspi xcallr5 xcallr12 xcallrv

xentrys0 xentrysi xentrysp xentrysii xentryspp xentrysip xentryspi xentryr5 xentryr12 xentryrv xpostcalls0 xpostcallsi xpostcallsp xpostcallr5 xpostcallr12 halt

blt bne beq beqp

bad0 bad1

new instruction set draft is in the next section.

---

ok currently we have about 86 instructions here. That's not too close to 60, and we'd like to add the floating point. So i don't think we'll be able to fit the opcode into 4 bits... unless we don't really need that many 3-operand instructions... actually i think we can do it if we're willing to be brutal about making things 2-operand instructions whenever possible, and about eliminating not-strictly-necessary instructions like beq

3-operand instructions: jpc # bit pattern must be all 0s so that the 'bad0' alias refers to something illegal jk li add ap casrmw # cas integer atomic, sequential consistency (AMO/RMW) casrmwrc # cas integer atomic, release consistency (both acquire and release) (AMO/RMW) casrmwrlx # cas integer atomic, relaxed (AMO/RMW) casprmw # cas pointer atomic, sequential consistency (AMO/RMW) casprmwrc # cas pointer atomic, release consistency (both acquire and release) (AMO/RMW) casprmwrlx # cas pointer atomic, relaxed (AMO/RMW) bne bnep blt more1 more2

2-operand instructions (TODO 43 needs to be 31+more): and ann appi # add (constant * INT_SZ) to pointer apwi # add (constant * PTR_SZ) to pointer malloc_shared mul or sll srs sru sub xor

sysinfo in out inp outp devop

cp cpp lb lbu lh lhu lp lpa # load pointer, atomic, sequential consistency. Must be aligned. Like a C atomic_load_explicit with memory_order_seq_cst. lprc # load pointer, atomic, release consistency RCpc (acquire). Must be aligned. Like a C atomic_load_explicit with memory_order_acquire. lprlx # load pointer, atomic, unordered. Must be aligned. Like a C atomic_load_explicit with memory_order_relaxed. lw lwa # load word, atomic, sequential consistency. Must be aligned. Like a C atomic_load_explicit with memory_order_seq_cst. lwrc # load word, atomic, release consistency RCpc (acquire). Must be aligned. Like a C atomic_load_explicit with memory_order_acquire. lwrlx # load word, atomic, unordered. Must be aligned. Like a C atomic_load_explicit with memory_order_relaxed. malloc_local sb sh sp spa # store pointer, atomic, sequential consistency. Must be aligned. Like a C atomic_store_explicit with memory_order_seq_cst. sprc # store pointer, atomic, release consistency RCpc (release). Must be aligned. Like a C atomic_store_explicit with memory_order_release. sprlx # store pointer, atomic, unordered. Must be aligned. Like a C atomic_store_explicit with memory_order_relaxed. sw swa # store word, atomic, sequential consistency. Must be aligned. Like a C atomic_store_explicit with memory_order_seq_cst. swrc # store word, atomic, release consistency RCpc (release). Must be aligned. Like a C atomic_store_explicit with memory_order_release. swrlx # store word, atomic, unordered. Must be aligned. Like a C atomic_store_explicit with memory_order_relaxed.

1-operand instructions: lk mfree lpc jt jy fence

xret0 # should think about eliminating some of these xretr5 xretr12 xcall0 xcallr5 xcallr12 xcallrv

0-operand instructions: xentry0 xentryr5 xentryr12 xentryrv xpostcall0 xpostcallr5 xpostcallr12 halt

break

aliases: bad0 bad1

mb: sai addi # add constant to integer swapi # swap two registers, integer (note: if both register are $1, this swaps the top two stack locs) swapp # swap two registers, pointer (note: if both register are &1, this swaps the top two stack locs) beq beqp

hmmm.... looks to me like we're close, but we can't quite fit those 2-operand instructions into the number of opcodes available, not to mention the new floating point instructions that we want to add. The multiplicity of atomic consistency modes, and the duplication of integer/pointer instructions, is killing us.

We could:

consider not supporting so many atomic consistency modes, and say that if a program needs atomicity with the efficiency of relaxed or release consistency, they have to either drop down to platform-specific stuff, or use a platform-specific library, or go up to OVM.
switch from 16-bit instruction encoding to 32-bit. We have 32-bit arithmetic so 32-bit instructions is not too much to ask.
reduce operand size from 4 bits to 3. We don't really need (and some platforms won't support) 16 registers in each bank.

all options are reasonable, but actually i think we should probably go to a 32-bit encoding. The reason is so that we can support longer pc-relative jumps and bigger jump constant tables. Otherwise it's just making things hard for the compiler; in previous proposals we were sort of relying on 'well, in that case you can always jump up to the longer instructions in the variable-length encoding', but here we don't have a variable-length encoding to fall back on anymore. As a bonus, with 32-bits, the instruction is 4 fields of 1 byte each, which may be slightly simpler for some people to implement. 32-bits is kind of dumb because now we have 8 bit operands but we don't have 255 registers, but we're trying to be dead simple here, not optimal.

well... actually wait a minute though.. jk and jt actually use a register operand to index into the jump table anyways, so the instruction encoding size doesn't matter for them. It matters for JPC, but the compiler can easily replace any given JPC with a constant load followed by a jk, so limiting JPC to 9 or 12 bits is not terribly burdensome; in fact i guess you could use all JKs and never use JPC; seen this way, JPC is just an optimization.

an argument for 16 bit, with a format bit, is that then OVM can just use this format as-is for its 16-bit representation. But that's getting into dangerous territory; because then we give up on all the optimizations that we could have had if we had used the more optimized 16-bit compressed format.

also, if we use 16 registers, we want to forbid use of registers 0-4, so as to reserve them for BootX?-style stack registers.

an argument for just using 32-bits is that it does seem simpler.

---

yeah, so i guess we should go for the simpler thing. Eschew any instruction encoding compatibility with OVM? Or use a bunch of format bits? If no format bits, then 32-bit representation with 4 8-bit fields: opcode and 3 operands. Should we just have 256 virtual registers? Maybe. If so, definitely have an annotation for each scope of code saying how many registers are in use (similar to what Lua does).

so here's a first guess at an instruction list:

meta and misc: ann sysinfo break nop

unconditional control flow: jpc # bit pattern must be all 0s so that the 'bad0' alias refers to something illegal jk jt jy lpc halt

conditional control flow: beq beqp bne bnep blt ble bgt bge bltp

constants: li sai lki lkp

data movement: cp cpp

arithmetic: add addi # add signed immediate constant to integer ap and appi # add (signed immediate constant * INT_SZ) to pointer apwi # add (signed immediate constant * PTR_SZ) to pointer mul or sll srs sru sub xor

allocation: malloc_shared malloc_local mfree

io: in out inp outp devop

memory access: lb lbu lh lhu lp lpa # load pointer, atomic, sequential consistency. Must be aligned. Like a C atomic_load_explicit with memory_order_seq_cst. lprc # load pointer, atomic, release consistency RCpc (acquire). Must be aligned. Like a C atomic_load_explicit with memory_order_acquire. lprlx # load pointer, atomic, unordered. Must be aligned. Like a C atomic_load_explicit with memory_order_relaxed. lw lwa # load word, atomic, sequential consistency. Must be aligned. Like a C atomic_load_explicit with memory_order_seq_cst. lwrc # load word, atomic, release consistency RCpc (acquire). Must be aligned. Like a C atomic_load_explicit with memory_order_acquire. lwrlx # load word, atomic, unordered. Must be aligned. Like a C atomic_load_explicit with memory_order_relaxed. sb sh sp spa # store pointer, atomic, sequential consistency. Must be aligned. Like a C atomic_store_explicit with memory_order_seq_cst. sprc # store pointer, atomic, release consistency RCpc (release). Must be aligned. Like a C atomic_store_explicit with memory_order_release. sprlx # store pointer, atomic, unordered. Must be aligned. Like a C atomic_store_explicit with memory_order_relaxed. sw swa # store word, atomic, sequential consistency. Must be aligned. Like a C atomic_store_explicit with memory_order_seq_cst. swrc # store word, atomic, release consistency RCpc (release). Must be aligned. Like a C atomic_store_explicit with memory_order_release. swrlx # store word, atomic, unordered. Must be aligned. Like a C atomic_store_explicit with memory_order_relaxed. casrmw # cas integer atomic, sequential consistency (AMO/RMW) casrmwrc # cas integer atomic, release consistency (both acquire and release) (AMO/RMW) casrmwrlx # cas integer atomic, relaxed (AMO/RMW) casprmw # cas pointer atomic, sequential consistency (AMO/RMW) casprmwrc # cas pointer atomic, release consistency (both acquire and release) (AMO/RMW) casprmwrlx # cas pointer atomic, relaxed (AMO/RMW) fence

interop: xret0 # should think about eliminating some of these xretr5 xretr12 xcall0 xcallr5 xcallr12 xcallrv

0-operand instructions: xentry0 xentryr5 xentryr12 xentryrv xpostcall0 xpostcallr5 xpostcallr12

floating point (optional extension):

load constant: lkf

convert signed integer to floating point: i2f

convert floating point to signed integer: ceil floor trunc nearest

arithmetic: addf subf mulf divf # we dont trap on divide by zero

floating-point signs and classification: copysign bnonfinite bnan binf

comparison: beqf # equivalence relation bltf # equivalence relation fcmp # typical floating point comparison, eg NaN? != NaN?; see ootArithmeticThoughts.txt

unsigned integer arithmetic or conversions?

mb but probably not: div rem call ret remf sqrt minf maxf powf fcmp

aliases: bad0 bad1

Should we say that the bitwidth of floating point is implementation-dependent?!? Note that e.g. Javascript only offers 64-bit, whereas ARM Cortex M4 only offers 32-bit.

I guess we could even say that about integers; we could say in both cases, the bitwidth is at least 32 bits but may be higher.

notes on registers:

there are three register banks; 32-bit integers, opaque pointers, and 64-bit floating point

in all register banks,

register 0 is constant 0 (otherwise we'd want instructions like bnz). For floating point instructions, register 0 is constant +0 (so e.g. copysign can be used for abs, and fneg can be accomplished by 0 - x)
registers 1 thru 3 are reserved
registers 16 and above are reserved

should we preserve the idea of the preallocated 'zero segment' at &0, or should &0 be a null pointer?

a few notes on encoding:

although i said we don't care about compatibility with OVM variable-length encoding, still, it would be great if we could manage that. One conundrum is that we have about 100 instructions already, but we only use 16 registers, so most of the bits we have free are in operands -- but to make things dead simple, if we are going to prohibit anything, it would be best to restrict opcodes rather than operands, so that we can use signed arithmetic and people don't have to think hard about the sign bit in constants (if we tried to make use of the high-order bit), yet also don't have to do bit operations to 'decode' operands (if we tried to make use of the low-order bits).
i think it would be simpler if we assume that the platform that the porter is using gives them access to the program as a bytestring, and assume that they are consuming instructions as a series of bytes, rather than as a 32-bit integer. In this case, to support OVM variable length encoding, it's simplest if the format encoding bits appear somewhere in the first byte of the instruction.
if we want opcode space to add further interop instructions, however, i think we really need at least 128, and maybe 256, opcodes. Which means that if we further want to reserve bits, we are forced to do so in operands. Which means that we should put the opcode byte at the end, not the beginning, of the instruction. Which means that we should restrict at least one immediate constant to small numbers. To make it easy to implement, we should do the same thing for all 3 operands. Which means that we should either restrict immediate constants to be positive, or we should use some weird encoding. Perhaps the way to do this is to use only positive immediate constants except for branches, which use an offset encoding. -- one thing i don't like about this, though, is that if we reserve, say, 3 bits off of each operand (so that we get 3 format bits at the beginning), now we only have 5*3 = 15 operand bits in total, which is pretty limiting in those cases when we concatenate the operands to make one big operand (e.g. to index into a constant or jump table). This might be important enough to make a special case of the first operand ('first' being whichever one we decide to put into the first byte, which is our choice).

so i think where i'm going with the encoding is: each instruction is thought of as 4 bytes (rather than one 32-bit integer, although we should consider that case as well). The first of these 4 bytes is limited to having a maximum value (as an u8) of 31. This represents one of the operands (probably the 'dest' operand).

this reduces the bitwidth of the concatenated operands from 24 (16 MiB?) to 21 (2 MiB?).
now the first of the 4 bytes can't be negative with a signed twos-complement representation. When/if negativity is ever desired an offset (bias) encoding can be used.
Registers don't go above 31 anyway, so this only matters when the operand is being used as an immediate constant. I should arrange the instruction encoding so that immediate constants don't use that operand anyways, except when all 3 operands are being concatenated.
this allows this fixed-length instruction set to be embedded as one choice within a variable-length instruction set

note: it should be noted with bltp that pointers are required to be partially ordered, but not totally ordered; this means that there may be some platforms with some pairs of pointers p1, p2, such that the pointers are not equal (beqp p1 p2 does not branch), but also neither one is less than the other (neither bltp p1 p2, nor bltp p2 p1, branch). However, it is prohibited for a pointer to be both less than and greater than the other; that is to say, it is prohibited for both bltp p1 p2 and bltp p2 p1 to branch (this prohibition corresponds to the antisymmetry property of partial orders in math).

---

since we are thinking about giving Boot 64 registers instead of 16, we would need to respecify the calling convention. One idea is to use even/odd for caller/callee save.

---

should Boot be provided with an in-memory stack?

---

(right now Boot has 4 profiles)

i think all this stuff about profiles is too complicated for little old Boot. Some of this arises from the history of me trying to keep the instruction count down in order to bit in the 16-bit encoding, which we are not using anymore. I think i should just throw in everything up to standard, plus addi not beq beqp sai appi apwi break, and mb the lg floating point stuff (remf is pretty useless tho right?) and leave the large atomics to lovm.

alternately, combine Boot and LoVM?. Since loVM adds so little, that's actually sounding like a good idea. But doesn't OVM want programs to guarantee that, except by using call/return/continuations/exceptions/etc, it won't just use GOTO to jump between procedures? Hmm, and there's the whole local variable thing. So mb keep them separate. So what i would like is to have one profile with everything that the Oot toolchain demands, and one or more profiles with everything else. Possibly 'everything else' could include a filesys extension, an expanded atomics extension, and an (integer division and floating point) extension.

hmm actually that may not be possible. There will be various system calls that some platforms provide and not others (eg nonblocking io), and various optimized primitives that some provide and not others (e.g. specific cryptography operations; dictionaries). My original idea was to push all this up to OVM but now i think it's easier for porters if they can plug all this in at the Boot level without porting OVM. So that means there are a lot of optional instructions in Boot, each of which is either individually optional or at most can be aggregated into small sets of instructions (extensions). I think that is inescapable.

What we can do is minimize the number of profiles below the set of things that Oot requires.

---

we may want to postpone the variable-length encoding, and the 64-bit encoding with address modes, to the OVM level? It looks like the main parts of lovm, the control abstractions and the local variables, can be expressed by just adding a few control opcodes in the same 32-bit encoding format as Boot, and then reinterpreting the code to refer to local variables instead of registers (so, to be clear, lovm would not just be a few more instructions added to Boot, it's a completely different semantics for the same encoding, because the semantics of registers and local variables differ).

that would suggest that maybe OVM should also be a linear bytecode, and we shouldn't have an AST until we get up to Oot Core? hmm.

i guess we have five main choices here:

lovm 32-bit, ovm 64-bit varlen
lovm 32-bit, ovm AST
lovm 64-bit fixed length, ovm 64-bit varlen
lovm 64-bit varlen, ovm 64-bit varlen
lovm 64-bit varlen, ovm AST

i guess one way to decide would just be to say it would be most beautiful if each step had a new level of complexity in the encoding format, matched by a longer bitwidth. But we can't have that, because Boot can't fit in 16 bits while remaining a dead simple 3-address code, and Oot Core is already an AST, which is at the top of our encoding format hierarchy. So Boot has to be 32 bits, and we don't really need a 128 bit format, so all that's left in between is 64 bits.

we could introduce a new type of encoding format, for instance we could reintroduce that NLP-based format i had before (called LONG on ootAssemblyThoughts, and talked about in ootAssemblyNotes13.txt and ootAssemblyNotes8.txt).

hmm skimming through ootAssemblyThoughts, there is a bunch of interesting stuff there, i guess OVM may be more complicated than i thought. Restricting our attention to the encoding format though, LONG is in some ways just a variable-length linear format that, unlike our 32-bit and 64-bit formats, but like many other assembly-type languages, not only allow length to vary within a small set of 'encoding format types', but also on a per-instruction basis, for both types of instructions (like the way that many assembly-ish languages have some opcodes that take a data/payload argument which immediately follows in the instruction stream, while other instructions do not; meaning that you don't know the total instruction length until you see which opcode it is), and also for modifiers (usually prefix modifiers) that can be attached to many different instruction types (for example, the x86 LOCK prefix, or the various CIL prefixes). In contrast to those, LONG has some 'self-describing' features intended to make it easy for the decoder to compute the total length of the instruction after reading the first byte (i think), and without looking up an opcode (so, without understanding or parsing the whole instruction).

in some ways, LONG is simpler than ASTs (instructions are not infinitely recursive 'nodes'), but in other ways, it's more complex (syntactically distinguished modifiers and annotations).

so that's interesting, but i think my main take away is that there's plenty of room at the top of our encoding format complexity hierarchy, it doesn't have to just end with 'AST'. So i think we should say:

lovm 64-bit varlen, ovm (either LONG or AST or something else)

---

so i'm going to slightly simplify Boot so that there is only one 'profile' smaller than 'standard'. 'standard' will be what the Oot implementation requires. The small profile will have two roles:

embedded systems, where the 'systemsy' stuff that the Oot implementation wants (eg malloc) isn't there.
a whitelist of Boot instructions that are 'safe' in some sense for 'direct execution'. So, probably computational and memory instructions but no syscalls, or something like that.

---

i'm removing this text from an (unsaved) boot_reference.md draft:

" There are four Boot profiles:

Tiny: 40 instructions. These should be able to be easily ported anywhere (provided the platform has a multiply instruction).
Small: 69 instructions. This profile adds integer division, swaps, floating point, atomics, and pointer comparions. The smallest embedded systems may not support these, but many embedded systems woul.
Standard: 94 instructions. This profile adds 'systems' instructions for accessing constant tables, allocating memory, I/O, interoperation, querying metadata about platform capabilities, and logging.
Large: 134 instructions. This profile adds various instructions that could be emulated in software, including various computational instructions, weak atomics (which are not required, as sequentially consistent atomics could always be used in their place), and breakpoint (a debugger could use annotation instructions instead).

tiny profile (32 instructions):

==
annotation	ann
constants	li
loads and stores and moves	cp cpp lb lbu lh lhu lp lw sb sh sp sw
arithmetic of ints	add sub mul
bitwise arithmetic	and or xor sll srs sru
arithmetic of pointers	ap
comparision control flow	bne bnep blt
other control flow	jrel jt jy lpc
misc	halt
==

small profile adds (29 instructions, for 61 total):

==
moves	swapi swapp
integer division	div rem
floating point	lkf i2f ceil flor trunc nearest addf subf mulf divf copysign bnan binf beqf bltf fcmp
pointer comparision	pcmp
atomics (sequential consistency)	lpa lwa spa swa casrmwa crsparmw fence
==

TODO unsigned integer arithmetic or conversions?

standard profile adds (25 instructions, for 86 total):

==
constants and constant tables	lki lkp jk
memory allocation	malloc_shared malloc_local mfree
I/O	in out inp outp devop
filesys	read write open close seek flush poll
interop	xentry xcall xret xpostcall library
misc	sysinfo log
==

TODO mb move lg profile to lovm?

large profile adds (46 instructions, for 134 total):

==
constant and arithmetic	addi not beq beqp sai appi apwi
atomics (release consistency)	lprc lwrc sprc swrc casrmwrc crsprmwrc
atomics (relaxed consistency)	lprlx lwrlx sprlx swrlx casrmwrlx crsprmwrlx
atomic additional rmw ops (sequential consistency)	addrmwa aprmwa andrmwa orrmwa xorrmwa addrmwa
atomic additional rmw ops (release consistency)	addrmwrc aprmwrc andrmwrc orrmwrc xorrmwrc addrmwrc
atomic additional rmw ops (relaxed consistency)	addrmwrlx aprmwrlx andrmwrlx orrmwrlx xorrmwrlx addrmwrlx
additional floating point	remf sqrt minf maxf powf bnonfinite bgtf beqtotf blttotf ftotcmp
misc	break

Note that the large profile adds only instructions that may be synthesized in software from the small profile.

Note that an implementation may legally provide sequential consistency when the program requests only release or relaxed consistency; furthermore, the additional RMW ops may be implemented using the CAS primitive; therefore, all of the atomics in the full profile may legally be implemented using only the atomics in the small profile as primitives. "

---

what about if we call an external function that takes a lot of arguments and some of those are strings? Neither nargsi and nargsp are in units of bytes, dont we want to point to an external struct with all of the arguments and just send the length? yeah in theory, but i guess it's okay to just say that our restrictive calling convention would force one of the pointers being passed to be a pointer to such a block

TODO should we have jrelcall variants of things like xcallii for calling within the Boot program?

no, and in fact, maybe we shouldn't have xcallii either. All that these do is save us from having to do two 'cp's preceding the call to move the integer inputs from whichever register they are in into the registers prescribed for the first two arguments by our calling convention. Who even knows if Boot code will use our calling convention? It might just use the stack. but.. this might make interop faster right because then the compiler can plug in the arguments straightaway without even doing those MOVs?

hmm... yes in the case of interop, no in the case of Boot code. So maybe keep those, but for interop only, like we have them currently.