Difference between revision 5 and current revision
No diff available.i'm still wondering where to put stuff like thread control, a scheduler, IPC primitives like as seen in microkernels. Also, i'm thinking one layer should support only cooperative multitasking and only the next layer up from that should support preemptive multitasking, but which layers?
OVM should have preemptive multitasking provided, so if cooperative multitasking comes first, it has to be, at the highest, somewhere in the layer below OVM (even if just as a library).
BootX? provides a bunch of optional platform primitives, plus instructions that can be simply implemented as a few macroinstructions, so if scheduling is going to be programmed de novo by our toolchain on some platforms, then it should be on a layer higher than BootX?.
Which leaves LOVM as the only option. The issue there is that a big part of the point of LOVM is that it's too annoying to write e.g. a garbage collection directly in Boot (or Boot+BootX?). And the same thing applies to a scheduler.
hmm, i guess though that if it's a library in LOVM then that objection doesn't apply -- the library can itself be written in LOVM.
so it's looking like this stuff should be in a library in LOVM.
---
regarding smallstack size:
" Most Microchip PIC 8-bit micros have a hardware stack with a depth of only 8! (the size will vary for different PIC devices). Because the stack depth on these micros is so small it is used only for function calls. Each function call will consume one level of the hardware stack. The rest of the variables are pushed into a software stack which is automatically handled by the compiler....Your microcontroller (PIC16F1709) has a 16-level hardware stack, which is a fairly good depth." -- [1]
---
yknow, actually, let's take the macros out of the LOVM assembly and put them in Lo only.
done.
---
can we do this thing called 'NaN? tagging' that wren does? It apparently allows you to have a uniform 8-byte representation for 32-bit ints, 64-bit doubles, and x64-64 pointers, avoiding the need to box floats and either use pointers or >8 byte representations:
" A compact value representation #
A core piece of a dynamic language implementation is the data structure used for variables. It needs to be able to store (or reference) a value of any type, while also being as compact as possible. Wren uses a technique called NaN? tagging for this.
All values are stored internally in Wren as small, eight-byte double-precision floats. Since that is also Wren’s number type, in order to do arithmetic, no conversion is needed before the “raw” number can be accessed: a value holding a number is a valid double. This keeps arithmetic fast.
To store values of other types, it turns out there’s a ton of unused bits in a NaN? double. You can stuff a pointer for heap-allocated objects, with room left over for special values like true, false, and null. This means numbers, bools, and null are unboxed. It also means an entire value is only eight bytes, the native word size on 64-bit machines. Smaller = faster when you take into account CPU caching and the cost of passing values around. " -- https://wren.io/performance.html#a-compact-value-representation
that page links to http://wingolog.org/archives/2011/05/18/value-representation-in-javascript-implementations for further explanation, which explains that NaNs? have 53 bytes free, and x64-64 pointers have only 48 usable bits.
i don't want to hardwire it in, i just want to make it possible to do this.
---
"What kind of CPU features required for operating system?
Privilege protections? Virtual address? Interrupt?
...
Based on this experience, I made the draft specifications of the interrupt and virtual address translation for our homebrew CPU. In order to keep it simple, we decided to omit hardware privilege mechanisms like Ring protection. ... I added interrupt simulation capability to our simulator which Wataru had made in the core part of CPU experiments, and also completed support for virtual address translation. This gave the simulator enough functionality to run the OS. ... When I ported Xv6 to MIPS, I had GDB, so it was rather OK, but our own simulator didn’t have any debug features, so it must have been very difficult to debug. Shohei couldn’t bear the difficulty of debugging, so he added a disassembler and a debug dump function to the simulator. After this, the simulator’s debugging features were rapidly upgraded by the OS team, and finally the simulator grew to look like the following picture.
---
For call3, if there are 4 operands, then you have to pass them in eight registers (or maybe eight positions on the small stack) because for each operand you need to pass both the value and the address to allow for all the various addressing modes. Alternately just the address of the operand is passed, and if the caller provides an immediate value then the implementation copies it into otherwise in accessible memory, and if the caller provides a register value, then the implementation copies it into otherwise inaccessible memory and copies it back into the register at the end of the call (Don't have to worry about it being in the register in the middle of the call if something else is called because instructions are "atomic")
so if You have two register banks and two small stacks and you need to have four extra arguments to CALL to say how many things need to be saved / Or on the other hand 4 extra arguments to ENTRY to say how many callee-saved things need to be saved. Or you could just say that SMALLSTACK is caller-saved, and then don't provide facilities for in caller saving, which means you'd only have two things to specify and only upon ENTRY.
alternately, you could specify how many SMALLSTACK locations in each bank will be needed in ENTRY and then a primitive could be provided to free that many locs in SMALLSTACK one way or another; maybe by popping stuff from the top of the stack, or maybe by spilling stuff from the bottom of the stack; the convention would be that you can't make any assumptions that you can access the callers stack from the Callee (negating the opportunity to use the stack to pass arguments through many levels of calls).
---
some other ideas for SMALLSTACK:
---
Regarding the advantage of multiple stacks, the stack computers book just says "In the case where the parameter stack is separate from the return address stack, software may pass a set of parameters through several layers of subroutines with no overhead for recopying the data into new parameter lists.
An important advantage of having multiple stacks is one of speed. Multiple stacks allow access to multiple values within a clock cycle. As an example, a machine that has simultaneous access to both a data stack and a return address stack can perform subroutine calls and returns in parallel with data operations." [2]. Neither of those are too important to us (the first is provided by argument registers, and the second is lower-level than we care about).
---
reflecting on the previous section, i think that for us, the only important advantages of SMALLSTACK is:
---
more notes on how many registers we need/ how large smallstack should be
I keep coming back to register and stack sizes of 16. If we had two register banks and two stack banks then if each of those wears a size 16 we have 64 locations total which is exactly 1/4 of 256 allowing easy addressing of registers by the implementation if everything is 32 bits. if Each register bank is a size 16 and that leaves eight for callersave registers and 8 for callee-save
the riwcv c extension has shortcuts for the eight most popular registers.
The riskv base has 32 int registers, And E profile has s16
---
"As a practical matter, a stack size of 32 will eliminate stack buffer overflows for almost all programs." -- [3]
---
"With how many local variables must a C compiler be able to deal? "
"The C standard (C99 and C11) states, in the Translation Limits section, that a compiler implementation must be able to handle at least “511 identifiers with block scope declared in one block.” (See C99 and C11, section 5.2.4.1. Previously, in C89/90, the limit was 127 identifiers, stated in section 5.2.4 Environmental Limits.) "
" Cliff Click , Wrote my first compiler at age 15, been writing them for 40 years Answered January 5, 2018 · Author has 100 answers and 99.4K answer views
Interesting answers below, but my experiences vary significantly. The standard limit is useless, real code that people really wrote and really want to compile routinely exceeds several hundred variables. No judgement from the compiler on the quality or maintainability of such code, we just did our best effort to make it work.
For machine generated code, the sky’s the limit. On at least one compiler I worked on, we raised the limit from 32K (16-bit signed short) to 64K (removed the sign extension issues) and finally switch to a 4-byte index to survive really abusive machine generated code.
Cliff 661 viewsView 13 Upvoters "
" Ed Bell , C is my language of choice. Answered December 30, 2017 · Author has 4K answers and 1.3M answer views
I know of no limit intrinsic to the standard, but the other answers about the stack are spot on.
Years ago, when I was writing code for MS-DOS using Tubo C, my heavily recursive algorithm required I bump the STKLEN up to 2048 to eliminate stack overflows.
So, if there is a limit, there should be a workaround. 91 views · Answer requested by Pedro Izecksohn "
---
so those make me think that maybe 8 callee-save registers in each bank just isn't enough. I think we really want at least 64 callee-save locations in total, so if SMALLSTACK isn't going to have any, that implies that each bank should have at least 32 callee-save locations, so 64 total locations. So we'd be looking at 2 banks of 64 registers and 2 SMALLSTACKS of 64 each.
But now that sounds like overkill.
So maybe 2 banks of 32 registers each (that's no worse than RISC-V, which has 32 base regs total; here, each register type (int, ptr) would have 32, so in fact it's already more regs than RISC-V), and 2 SMALLSTACKS of 32 locs each. For a total of 128 locations. Now we have 16 callee-save locs per bank; and it takes only one byte in ENTRY to specify how many callee-save locs we need to spill across both banks.
This also has the advantage that it uses up the greatest number of registers possible while still one power-of-two for the implementation (assuming we only have one byte to store register specifiers). This also has the advantages that each bank of callee-save things is only size 16, and so the number of callee-saves used in two such banks can be specified in one byte (via 2 4-bit fields)
---
And we could still decide to make half of each SMALLSTACK callee-save. That would give each subroutine a total of 32 callee-save locations of each type, handily beating RISC-V (which presumably has only ~16 callee-save total). What would that look like? The convention could be:
you could consider using two more bytes to specify the max number of caller-saves used, so that some implementations know how many registers really need to be allocated here; not sure that's worth it though.
for the caller-save portions, does the caller really need to clear the stack, or can they rely on the callee to do that only if they really need it? the latter only makes a lot of sense if there is a quick 'wipe stack' instruction.
however one issue with this is that you might not want the callee-saves on each SMALLSTACK to be beneath the temporaries. Maybe you'd want 4 SMALLSTACKS (2 for each reg bank). This actually isn't too crazy on the addr modes; in SMALLSTACK addr mode you only need an offset of 16, so with one-byte operands you have 4 bits free to choose among 4 SMALLSTACKS (and you only really need 2, since you have different instructions for each bank; mb we can have 2 MEMSTACKS also (actually for the MEMSTACKS, since they mix both ints and ptrs, you probably really want at least 2 4-bit fields so that you can indicate up to 16 ints and up to 16 ptrs to skip over; which means you do want a separate instruction for those)).
---
"While 16-bit instructions may seem wastefully large, the selection of a fixed length instruction simplifies hardware for decoding, and allows a subroutine call to be encoded in the same length word as other instructions. A simple strategy for encoding a subroutine call is to simply set the highest bit to 0 for a subroutine call (giving a 15 bit address field) or 1 for an opcode (giving a 15 bit unencoded instruction field" https://users.ece.cmu.edu/~koopman/stack_computers/sec3_2.html
---
it's kinda cool to have 2 SMALLSTACKS of each type, but otoh it might make sense to have just 1 SMALLSTACK of each type and have a convention of having half of it free upon calling. One of the points of a stack is that you don't have to save and restore registers because 'your' registers are numbered from the TOS at the time you are passed control.
note that this suggests that even when accessing deep stack like registers (eg read and overwrite instead of POP and PUSH), that the items should still be indexed according to their distance from TOS
--
oo, just realized! fitting all these things in one byte allows you to access both registers and SMALLSTACKs in ANY addressing mode, rather than devoting extra addressing modes to them. You can even fit both kinds of SMALLSTACK access! e.g.:
(and we still have 63 spaces left; maybe these could be left to the implementation, and/or maybe some of these could be positions on MEMSTACK!)
---
so it's looking like ENTRY has 3 operands:
maybe the 4th could be int/ptr MEMSTACK space needed..
and btw i realized, to access a position on MEMSTACK, you don't need to sum int spaces to skip over, ptr spaces to skip over, if the implementation can keep track of that (and probably the implementation will keep all the ints together and all the ptrs together); then you only need to specify int vs ptr (1 bit), plus an index as to which integer/ptr.
---
mb we don't need to provide a stack depth primitive as long as we provide:
this would allow some implementations to eg use a circular buffer without tracking stack depth (when they spill they might have to assume that the entire circular buffer is in use tho).
if a non-circular buffer, the stack ptr and hence stack depth could be discovered from inspecting the in-memory copy of the stack (which might have a defined representation?); but if a circular buffer, it would always appear to be at the maximal depth
or, to help in not over-allocating memory when copying stack to memory (assuming the user pre-allocates and passes a pointer ot the empty space), we could have a stack depth primitive, but make it optional; it's okay for the implementation to always return the max (which is 32 items)
---
https://keleshev.com/ldm-my-favorite-arm-instruction/
ldm is 'load multiple'. There is also stm , store multiple. push and pop also have 'multiple' forms. These can all do any subset of the 16 registers.
arm64 only allows dealing with pairs of registers.
---
"
Paged Segments
The memory model introduced by the 386 is wonderfully Byzantine. It supports not just paging, but segmentation. If you've read an operating systems textbook, you'll know that operating systems prefer paging, because it makes fragmentation of physical memory much rarer. Programmers, however, prefer segments.
For example, consider the mprotect() system call, which alters the permissions on a range of memory. When you call it on a system using paging, the permissions must be set on a page boundary, typically 4KB, but sometimes much larger—64MB on some mainframes. If you're protecting a small structure, the side effect is that you also change a lot of unrelated memory. This situation is particularly bad in a language like C, where malloc() may service two requests from the same page, preventing you from setting different permissions on them.
In contrast, segments are variable-sized. You can set permissions on any arbitrary block of memory with x86 by using the segmentation mechanism. The aim was for programmers to use segments and not care about pages, while operating systems would use pages but not care about segments. The segments map from a segment address to a linear address, which is then mapped to a physical address by the paging mechanism.
With an object-oriented language, you might use a new segment for each object, using the segment ID instead of object pointers, and then have automatic bounds checking on every instance variable access and every array access.
Even better, you could implement a copying garbage collector trivially, by marking the segment as no-access during the copy and then waiting on a lock in the segmentation-violation signal handler. Because all accesses would be segment-relative, you would never have to worry about inner pointers—every access would be a segment ID and an offset within that segment.
Unfortunately, this plan didn't work out. C didn't allow segments, so only systems like iRMX, written in PL/M, could use them. More seriously, you were limited to 8192 segments per process, with another 8192 global segments. This number is too low for even a relatively simple object-oriented system. Rather than increasing this limit with x86-64, AMD simply removed the segmentation system.
The most frustrating thing about the segmentation limit is that Intel produced another chip, the iAPX 432, which introduced the memory model found on modern IA32 chips. Released in 1981, it was only three years after the 8086 and four years before the 80386. It had a similar segmentation system to the previous chip, but provided 24-bit segment identifiers, giving almost 17 million segments per program. More than enough for a lot of programs—certainly more than enough in the 1980s, although that number might be in need of some expansion by now. " -- https://www.informit.com/articles/article.aspx?p=1676714&seqNum=2
---
toread: https://www.informit.com/articles/article.aspx?p=1077906
---
" RISC-V SIMD, as opposed to classic SIMD, is really something to be excited about. " -- comment on https://www.anandtech.com/Show/Index/15036?cPage=5&all=False&sort=0&page=1&slug=sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip
" High-performance computing has been largely taken over by GPUs, which are in essence super-wide SIMD machines, using predicate vectors for much of its flow control. (Predicates being only late additions to SSE and Neon) The scalable vector proposal for RISC-V is by some considered so promising that there have been even been talks about building GPUs based around the RISC-V SIMD ISA -- optimised for SIMD first and general-compute second. " -- comment on https://www.anandtech.com/Show/Index/15036?cPage=5&all=False&sort=0&page=1&slug=sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip
---
A negative comment on RISC-V, from https://www.anandtech.com/Show/Index/15036?cPage=5&all=False&sort=0&page=1&slug=sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip:
Wilco1 - Wednesday, October 30, 2019 - link ... My point is it's the same people repeating the exact same mistakes. It has the same issues as MIPS like no register offset addressing or base with update. Some things are worse, for example branch ranges and immediate ranges are smaller than MIPS. That's what you get when you're stuck in the 80's dogma of making decode as simple as possible...
A comment on what ARMv8 has changed that is good, from https://www.anandtech.com/Show/Index/15036?cPage=5&all=False&sort=0&page=1&slug=sifive-announces-first-riscv-ooo-cpu-core-the-u8series-processor-ip: ... There is a HUGE amount of learning that informed ARMv8, from the dropping of predication and shifting everywhere, to the way constants are encoded, to high-impact ideas like load/store pair and their particular version of conditional selection, to the codification of the memory ordering rules. Look at SVE as the newest version of something very different from what they were doing earlier.
---
so one of those comments complains about RISC-V branch offset range and immediate ranges in RISC-V, compared to MIPS. How large are those exactly?
According to https://en.wikipedia.org/wiki/MIPS_architecture ,