Table of Contents for Programming Languages: a survey


" Many of the designs for these stack computers have their roots in the Forth programming language. This is because Forth forms both a high level and assembly language for a stack machine that has two hardware stacks: one for expression evaluation/parameter passing, and one for return addresses. In a sense, the Forth language actually defines a stack based computer architecture which is emulated by the host processor while executing Forth programs. The similarities between this language and the hardware designs is not an accident. Members of the current generation of stack machines have without exception been designed and promoted by people with Forth programming backgrounds. " -- (1989)

" Stack machines of all kinds may be classified by a taxonomy based upon the number of stacks, the size of the dedicated stack buffer memory, and the number of operands in the instruction format. The stack machines featured in this book are those with multiple stacks and 0-operand addressing. The size of the stack buffer memory is a design tradeoff between system cost and operating speed. For the bulk of this volume, "stack machines" refers to these particular machines. " --

" Historically, computer designs that promise a great deal of support for high level language processing have offered the most hardware stack support. This support ranges from a stack pointer register to multiple hardware stack memories within the central processing unit. Two recent classes of processors have provided renewed interest in hardware stack support: RISC processors, which frequently feature a large register file arranged as a stack, and stack oriented real time control processors, which use stack instructions to reduce program size and processor complexity. " -- (1989)

" A disadvantage of a single stack is that parameter and return address information are forced to become mutually well nested. This imposes an overhead if modular software design techniques force elements of a parameter list to be propagated through multiple layers of software interfaces, repeatedly being copied into new activation records.

Multiple Stack computers have two or more stacks supported by the instruction set. One stack is usually intended to store return addresses, the other stack is for expression evaluation and subroutine parameter passing. Multiple stacks allow separating control flow information from data operands.

In the case where the parameter stack is separate from the return address stack, software may pass a set of parameters through several layers of subroutines with no overhead for recopying the data into new parameter lists.

An important advantage of having multiple stacks is one of speed. Multiple stacks allow access to multiple values within a clock cycle. As an example, a machine that has simultaneous access to both a data stack and a return address stack can perform subroutine calls and returns in parallel with data operations. " --

" To be competitive in speed, a stack machine must have at least one or two stack elements buffered inside the processor. To see the reason for this, consider an addition operation on a machine with unbuffered stacks. A single instruction fetch for the addition would generate three more memory cycles to fetch both operands and store the result. With two elements in a stack buffer, only one additional memory cycle is generated by an addition. This memory cycle is used to fetch the new second-from-top stack element, filling the hole created by the addition's consumption of a stack argument. " --

" A small stack buffer with primary stacks residing in program memory allows quick switching between stacks for different tasks since the stack elements are predominately memory resident at all times.

The fact that a small dedicated stack buffer is simple to implement and easy to manage makes it very popular. In particular, the fact that most stack elements reside in main memory makes managing pointers, strings, and other data structures quite easy. The disadvantage of this approach is that significant main memory bandwidth is consumed to read and write stack elements. " --

" If an architecture has a large enough stack buffer that main memory bandwidth is usually not consumed to access stack elements, then the architecture has a Large Stack Buffer. This large buffer may take one of several forms. It may be a large set of registers accessed using a register window scheme such as that used by the RISC I processor (Sequin & Patterson 1982), a separate memory unit that is isolated from program memory, or a dedicated stack memory cache in the processor (Ditzel & McLellan? 1982). In any event, the stack buffer is considered "large" if several levels of subroutines (say, 5 or more) may be processed without exhausting the capacity of the stack memory. In the case of a stack that is only used as an expression evaluation stack, "large" may only be approximately 16 elements, since single expressions seldom nest very deeply (Haley 1962). In Chapter 6, we shall examine some program execution statistics that will give more insight into how large is large enough.

An advantage of a large stack buffer is that program memory cycles are not consumed while accessing data elements and subroutine return addresses. This can lead to significant speedups, particularly in subroutine intensive environments.

A disadvantage of a separate stack memory unit is that it may not be large enough for all applications. In this case a spilling of data into program memory to make room for new stack entries may be required. Also, saving the entire stack buffer when switching between tasks in a multitasking environment may impose an unacceptably large context switching overhead, although it should be noted that this can be solved by dividing the stack memory into separate areas for separate tasks. At a lower level, separate data buses for off-chip stack memories and program memory will add pins and expense to a microprocessor. " --

" There are several advantages to the simplicity of 0-operand instructions. One is that only the top one or two stack locations can be referenced by an instruction. This can simplify construction of the stack memory by allowing the use of a single ported memory with one or two top-of-stack registers. A speed advantage may also be gained by loading the operand registers in parallel with instruction decoding, since the operands for each instruction are known in advance to be the top stack elements. This can completely eliminate the need for pipelining to fetch and store operands.

Another advantage is that individual instructions can be extremely compact, with an 8 bit instruction format sufficing for 256 different opcodes. Furthermore, instruction decoding is simplified, since no operand addressing modes need be interpreted by the decoding hardware.

A disadvantage to the 0-operand addressing mode is that complex addressing modes for data structure accessing may take several instructions to synthesize. Also, data elements that are deeply buried on the stack can be difficult to access if provisions are not made for copying the Nth-deep data stack element to the top of the stack. ... 2-operand machines offer a maximum of flexibility, but require more complicated hardware to perform efficiently. Since no operands are known before an instruction is decoded, a data pipeline and dual ported register file must be used to supply operands to the execution unit. " --

" In order to convey the category for a particular architecture, we shall use a three character shorthand notation based on the three axes of classification. The first letter of the abbreviation specifies the number of stacks (Single or Multiple). The second letter of the abbreviation specifies the size of dedicated stack memory (Small or Large). The third letter of the abbreviation specifies the number of operands in the instruction format (0, 1, or 2). Thus, the abbreviation SS0 would signify an architecture with a single stack, small dedicated stack memory, and 0-operand addressing and the abbreviation ML2 would signify an architecture with multiple stacks, large dedicated stack memory, and 2-operand addressing. ... 2.3 INTERESTING POINTS IN THE TAXONOMY.

Perhaps the most surprising feature of the taxonomy space is that all twelve categories are populated by designs. This indicates that a significant amount of research on diverse stack architectures has been performed. Another observation is that different machine types tend to group along the operand axis as the major design parameter, with the number and size of stack buffers creating distinctions within each grouping.

The taxonomy categories with 0-operand addressing are "pure" stack machines. Unsurprisingly, these categories have the most academic and conceptual design entries, since they include the canonical stack machine forms. Because of its inherent simplicity, the SS0 machine category is populated by designs constrained by scarce hardware resources, limited design budget, or both. Designs in the SS0 category may have efficiency problems if return addresses and data elements are intermingled on the stack and an efficient deep stack element copy operation is not provided.

The SL0 category seems to only be applicable to special purpose machines used for combinator graph reduction (a technique used to execute functional programming languages (Jones 1987)). Graph reduction requires doing a tree traversal to evaluate the program, using a stack to store node addresses while performing the traversal. No expression evaluation stack is required, since the results are stored in the tree memory itself.

The MS0 and ML0 categories have similarly designed machines, distinguished mainly by the amount of chip or board area spent for buffering stack elements. All the Forth language machines and many of the other high level language machines fall into these categories. Machines in these categories are finding increasing use as real time embedded control processors because of their simplicity, high processing speed, and small program sizes (Danile & Malinowski 1987, Fraeman et al. 1986). Many MS0 and ML0 designs allow for very fast or even zero-cycle subroutine calls and returns.

The entries with 1-operand addressing are designs that attempt to break bottlenecks that may arise in 0-operand designs by altering the pure stack model into a stack/accumulator design. The SS1 designs can use an address to access local variables and frame pointers more easily than SS0 designs. In general, the perceived advantage of a 1-operand design is that a push operation can be combined with an arithmetic operation, saving instructions in some circumstances. Additionally, the Pascal and Modula-2 machines use 1-operand addressing because of the nature of P-code and M-code.

The entries with 2-operand addressing tend to be more mainstream designs. Conventional microprocessors fall into the SS2 category. The RISC designs fall into the SL2 category because of their register windows, and no other designs fall into this category. The MS2 categorization for the 680x0 family reflects the flexibility of that machine, which can use any one of its eight address registers as a stack pointer. The ML2 entry for the PSP machine reflects an attempt in a conceptual design to carry the register window to an extreme for speeding up subroutine calls. The SF1 machine also uses multiple stacks, but dedicates a hardware stack to each active process in a real time control environment.

From the preceding discussion, we see that designs can be found that fall into all twelve categories of the taxonomy. Designs within each group of the taxonomy display strong similarities, while designs in different groups can be shown to have differences that affect implementation and system operation. Thus, the taxonomy is a useful tool for gaining insight into the nature of stack oriented computers.

In the next chapter, we shall focus on a certain sector of the stack machine design space: MS0 and ML0 machines. In all future references, we shall mean either an MS0 or ML0 machine when we use the term "stack machine." " -- ,


Multiple stack, 0-operand machines have two inherent advantages over other machines: 0-operand addressing leads to a small instruction size, and multiple stacks allow concurrent subroutine calls and data manipulations. " --

" The amount of stack memory needed by most programs is typically rather small. Furthermore, it can be guaranteed by design to be small in short, time-critical processes. So, even a modest stack buffer of 128 elements can be divided up among four processes with 32 elements each. If more than four processes are needed by the multi-tasking system, one of the buffers can be designated the low priority scratch buffer, which is to be shared using copy-in and copy-out among all the low priority tasks. " --

" as processors get faster, the question is: how much stack memory needs to be placed on-chip to obtain good performance? The answer to this question is crucial, since it affects the cost and performance of high-end stack processors that place stack memory on-chip. ...The first question, the one of the size of the on-chip stack buffer, is best resolved by a simulation of various programs with different size stack buffers... As a practical matter, a stack size of 32 will eliminate stack buffer overflows for almost all programs. " --

" The stack machine design philosophy falls somewhere in between the philosophies of high level language machine design and RISC design. Stack machines provide very simple primitives which may be executed in a single memory cycle, in the spirit of the RISC philosophy. However, efficient programs on stack machines make extensive use of application specific code that is accessed via cheap subroutine calls. This collection of subroutines may be thought of as a virtual instruction set that is tailored to the needs of high level language compilers, without requiring complicated hardware support. " --

" The problem in generating code for stack machines from C is that there are several assumptions about the operating environment deeply entrenched in the language. The most profound of these is that there must be a single program-memory-resident stack that contains both data and subroutine return information. This assumption can not be violated without "breaking" many C programs. As an example, consider the case of a pointer that references a local variable. That local variable must reside in the program memory space, or it can not be properly referenced by the program.

To make matters worse, C programs typically push large volumes of data, including strings and data structures, onto the C-stack. Then, C programs make arbitrary accesses within the area of the current stack frame (the portion of the stack containing variables belonging to the current procedure). These restrictions make it unfeasible to attempt to keep the C stack on a stack machine's Data Stack.

How then can a stack machine be made efficient at running C programs? The answer is that the stack machine must efficiently support frame-pointer-plus-offset addressing into program memory.


The other notion often found in C, and other high level languages, that does not map well onto stack machines is that of a "register variable." Since stack machines do not have a set of registers, this implies that compiler optimization opportunities may be missed by stack machines. This is only partially true. While it is true that stack machines are not well suited for juggling a large number of temporary values on the stacks, a small number of frequently accessed values can be kept on the stack for quick reference. For example, these values might include a loop counter kept on the return stack and two addresses for a string compare kept on the data stack. In this manner, most of the efficiency of the hardware can be captured for the majority of C programs. " --

" There is evidence that programming languages used to implement rule-based systems, such as those written in Prolog, LISP, and OPS-5 are very well suited to stack machines. One very exciting possibility is the marriage of real time control applications with rule-based system knowledge bases. Preliminary research into this area has been encouraging. ... This speedup of nearly three orders of magnitude is astonishing to some, but merely reflects the suitability of using a stack machine, which is good at tree traversal, for solving problems that use decision trees.

The speedup observed for the rule-based system is actually based on a principle that applies to a wide variety of problem areas. Stack machines can treat a data structure as an executable program. Consider for a moment an example of a tree data structure, with pointers at the internal nodes and program action tokens at the leaves. The nodes of the trees that are pointers can just be the addresses of the children, which equates to subroutine calls in many stack processors. The leaves of the trees can be executable instructions, or subroutine calls to procedures that accomplish some task. A conventional processor would have to use an interpreter to traverse through the tree in search of the leaves. A stack processor can just directly execute the tree instead. Since stack machines execute subroutine calls very quickly, the results can be extremely efficient. The technique of directly executing tree-formatted data structures is responsible for the tremendous speed of the RTX 32P example cited in the previous paragraph. " --

" Stack machines are well suited to LISP programming as well as to expert systems. This is because LISP and Forth are very similar languages in many respects. Both treat programs as lists of function calls to other lists. Both are extensible languages. Both use Polish notation for arithmetic operations. The major difference is that LISP involves dynamic storage allocation for its cells, while Forth uses a static storage allocation. Since there is no reason that a stack machine should be any worse at garbage collection than other machines, LISP should run efficiently on a stack machine.

Many of the same arguments about stack machines' suitability for LISP apply to Prolog. In a Prolog implementation for the RTX 32P, this author made an additional discovery about how to efficiently map Prolog onto stack machines. Prolog uses typed data that can be either an actual data element or a pointer to other data. A possible encoding for Prolog data elements is one that uses the highest 9 bits of a 32-bit word for a data type tag. The lowest 23 bits are then used as either a pointer to another node, a pointer to a 32 bit literal value, or a short literal value. Using this data format, data items are actually executable as instructions. Instructions for the RTX 32P can be constructed that allow traversing an arbitrarily long series of pointer dereferences at the rate of one dereference per memory cycle, simply by executing the data structure as a program. Nil pointer checking can be accomplished by defining the nil pointer value to be a subroutine call to an error trapping routine. These kinds of data handling efficiencies are simply not possible with other types of processors.

Functional programming languages offer the promise of a new way of solving problems using a different model of computation than that used by conventional computers (Backus 1978). A particular method of executing functional programs is the use of graph reduction. The same techniques of direct execution of the program graphs that were discussed for rule-based systems above are equally applicable to graph reduction. Thus, stack machines should be good at executing functional programming languages. Belinfante (1987) has published a Forth-based implementation of graph reduction. Koopman & Lee (1989) describe a threaded interpretive graph reduction engine.

From a theoretical point of view, efficient graph reduction machines such as the G-machine and Norma fall into

the SL0 category of the taxonomy in Chapter 2. ML0 machines have a superset of the capabilities of SL0 machines, and should therefore be efficient at graph reduction as well. Initial investigations into this area by this author show that the RTX 32P, which is a very simple stack machine, can compete quite effectively with even very complex graph reduction machines such as Norma.

One of the side effects of using a functional programming language is that a high degree of parallelism is available during program execution. This raises the idea of a massively parallel computer made of stack processors that is programmed with a functional programming language. " --


A key conceptual feature of stack machines is their uniformity of interface between high level code and machine instructions. Both procedure calls and opcodes use the stack as a means of passing data. This consistent interface has several positive impacts on software development.

The source code for a program does not have to reflect in any manner which instructions are directly supported by the machine and which functions are implemented as procedures at the Forth language level. This capability suggests the use of a low-level stack language, similar to Forth, that is the target of compilation for all languages. Given an assembler from this target language to the actual machine, the user is freed from worry about how particular functions are implemented. This means that various implementations of the same architecture can be made very compatible at the stack-based source code level without actually having to provide all instructions on low-cost implementations. If the same interface is used for conventional languages as well as Forth, then combinations of C code, Forth code, and code from other languages can be intermingled without problems.

In microcoded machines, such as the RTX 32P, this interface can be exploited one step further. Application specific microcode can be used to replace critical sequences of instructions in the application program, and common compiler generated code sequences transparently to the user. In fact, a common method of application code development on a microcoded stack machine is to first write the entire application in high level code, then go back to microcode the critical loops. The rewriting of subroutines from high level language to microcode is invisible to the rest of the program, except for the speed increase. This speed increase is a speedup factor of approximately two for many applications. " --

" Conventional languages can be implemented very easily on stack machines. The only problem is that pure stack machines probably can not perform quite as well as register machines when running conventional programs written in the normal programming style. This problem is mostly one of a mismatch between stack machine capabilities and the requirements of conventional languages. Conventional programs tend to use few procedure calls and large numbers of local variables. Stack machines tend to be good at running programs with many small procedures and few local variables. In part, the difference is due to the programming styles encouraged by common practice and the structure of conventional programming languages. To some extent, the difference is that register machines are well suited to general purpose data processing, whereas stack machines perform best in a real time control environment. " --

" 9.1.1 Stack frames

The issue, then, is to identify high level language structures that require additional hardware support. Strangely, the high level language run-time stack is the only major area in which pure stack machines fail to support conventional high level languages. This is because high level languages have the notion of "activation records," which are "frames" of elements pushed onto a software-managed stack in program memory upon every subroutine call. In the usual implementation, each stack frame is allocated only once as a large chunk of memory in the preamble to the subroutine being called. This frame contains the input parameters (which were actually allocated by the calling routine), the user declared local variables, and any compiler generated intermediate variables that might be necessary. During the course of the subroutine, arbitrary accesses are made within the stack frame without actually performing any pushes or pops.

The stack frame is used as a temporary memory allocation device for subroutine calling, not as a traditional pushdown stack for storing intermediate results between successive calculations. This means that it is incompatible with hardware stacks built into stack machines such as those we have been studying. An obvious approach to modify stack machines to meet this requirement is to build the primary stack so as to be allocated in large chunks with random access within a frame. This is precisely how the RISC machines with register windows (described as SL2 machines in Chapter 2) solve the problem. The reason why this does not make sense for stack machines is that all accesses to data pay the penalties associated with having operands in the instruction format.

An alternative approach is to build a secondary hardware stack for accessing local variables at a "slow" speed, with primary data manipulations done on a LIFO hardware data stack. This lets us have the best of both worlds, but is not without cost. That secondary register frame stack will in general have to be 5 to 10 times as big as the data stack for good operating characteristics.

9.1.2 Aliasing of registers and memory

If this were the end of the tradeoff discussion, we might still be tempted to build a chip with on-chip frames. But, there is a deciding factor which tilts the balance in favor of placing the stack frames in program memory. That additional factor is that the semantics of conventional languages allow access to these local variables by memory address. The C language is notorious for this problem, which affects register machines and stack machines alike.


It is indeed tempting to write complex compilers in an attempt to keep most local variables on the hardware stack while executing conventional languages. An experiment by this author with some stack machines has shown, however, that the difference between a C compiler that keeps all local variables on the hardware data stack and one that uses program memory with a frame pointer is small. In fact, if the machine has a hardware frame pointer, keeping the frames in program memory is actually somewhat faster.

Normally, one would think that keeping local variables on the hardware data stack would be faster than placing them in memory. The reason this is not so is because stack machines are relatively inefficient at accessing deeply buried elements, especially if they may have been spilled out of the stack buffer and into program memory. Much of the time in the all-on-hardware-stack approach is spent thrashing elements about on the stack. While access to memory-resident frame elements is somewhat slower than access to data on the hardware stack, in the end the savings on stack manipulations make up the difference.

9.1.3 A strategy for handling stack frames

A good compromise approach for handling frames is a mixture of the program memory resident and hardware data stack resident methods. The details of the approach are as follows. All procedure calls place the parameters to be passed on the hardware stack. Since these parameters must be computed or moved from one stack frame location to another using the hardware data stack anyway, no time is lost doing this. The procedure call is then made. The called procedure then copies all but one or two of the parameters from the hardware stack into its newly allocated frame as local variables. The compiler must be sure that the parameters left on the data stack are not referenced by address. This can be accomplished with register variable declarations by the programmer, compiler analysis, or just playing it safe and copying all parameters to memory. Having only one or two parameters on the data stack minimizes stack spills and still provides good efficiency gains. When the procedure terminates, the returned value is placed on the hardware stack.

The compromise approach gives good efficiency by saving a large number of references to memory. It also provides an excellent interface between different languages, such as Forth and C. It is relatively easy to write compilers to handle this approach as well. Also, many stack machines now in existence have a hardware frame pointer, so can easily support this scheme.

" --

" 9.1.4 Conventional language execution efficiency

From this discussion, we can see that stack machines can be reasonably efficient at supporting conventional programming languages with their stack frame approach. Of course, stack machines that use a hardware frame pointer into program memory cannot be expected to be as efficient as RISC machines, since they have massive on-chip register windows for direct frame support or exceedingly clever optimizing compilers that do sophisticated global register allocation.

To the extent that programs in conventional languages conform to the models used by high performance register machines, stack machines will seem to perform poorly. Aspects of conventional programs that cause this effect include: large segments of straight-line code, near-100% cache hit ratios, large numbers of local variables used across long procedures, and shallowly nested procedures that can be compiled as in-line code.

To the extent that programs use structures which run efficiently on stack machines, stack machines can approach or even exceed the performance of register-based machines. Code which is run well by stack machines contains: highly modular procedures with many levels of nesting and perhaps recursion; a relatively small number of frequently used subroutines and inner loops that may be placed in fast memory, and perhaps provided with microcode support; small numbers of local variables passed down through many layers of interface routines; and deeply nested subroutine calls. Also, programs that operate in environments with frequent interrupts and context switches can benefit from using stack machines.


One should also keep in mind that there are reasons to select a computer other than raw processing speed for single programs. The reasons that might tilt the balance in favor of stack machines include: interrupt processing speed, task switching speed, low overall system complexity, and the need for application specific microcode and/or hardware support. In the final analysis, stack machines will probably not run many conventional language programs quite as quickly as register-based machines. But, other considerations will largely cancel out the drawbacks, especially for real time control applications, making stack machines an excellent alternative. "



An often proposed design alternative for stack machines is the use of a third hardware stack. The purposes given for adding a third hardware stack are usually for storage of loop counters and local variables.

Loop counters on the current stack machines are generally kept as the top element of the return address stack. This is because subroutines and loops are mutually well nested, and it is considered bad programming style for a subroutine to attempt to access the loop index of its parent procedure. So, while there is some conceptual merit to having loop indices in their own stack to avoid cluttering the return stack with non-address data, the performance and program effectiveness gains are not sufficient to justify the hardware expense.

Local variable storage is another issue. Even when using the Forth language, programmers have found that the concept of compiler-managed local variables can make some programs easier to create and maintain. In order to do this efficiently, the hardware needs access to a stack that is allocated in frames, with random access to locations with in the frame. This is a requirement that is very like that for supporting conventional languages. So, in fact, the best solution is probably not to have a third hardware stack at all. Rather, stack machines should support a frame pointer into a software-managed program memory stack that is used both for conventional language support and for support for local variables in Forth. " --



There are two interesting stack machine design details that are not in common usage, but which may prove useful in future designs.

9.5.1 Conditional subroutine returns

One of the design details is an observation by Doran (1972) that stack machines do not need conditional branches, they only need conditional subroutine return instructions. Consider an IF statement in a high level language. If we ignore the optional ELSE clause, an IF statement appears to be a piece of code with one entry point and two exit points. The entry point is the beginning of the statement, where the branch condition is evaluate. The first exit point is if the condition is false, in which case none of the rest of the statement is executed. The second exit point is the end of the statement, when all actions have been completed.

The usual way of implementing an IF statement is by using a conditional branch that is taken if the condition tested by the IF is false. Instead, consider a subroutine that contains the code for the entire IF statement. The entry point to the IF statement is a subroutine call into this special subroutine. The first exit point can be a conditional subroutine return instruction, that only returns if the condition clause of the IF statement is false. The second exit point can be an unconditional return.

What this scheme accomplishes is elimination of conditional branches with embedded addresses. All that is required is a conditional subroutine return statement. This technique is well suited to stack machines, because of the low cost of the initial subroutine call into the IF statement's subroutine and the low cost of the subroutine exits. It may lead to more efficient machines than those currently in use.

9.5.2 Use of the stack for holding code

Another interesting proposal for stack machine program execution was put forth by Tsukamoto (1977). He examined the conflicting virtues and pitfalls of self-modifying code. While self-modifying code can be very efficient, it is almost universally shunned by software professionals as being too risky. Self-modifying code corrupts the contents of a program, so that the programmer cannot count on an instruction generated by the compiler or assembler being correct during the full course of a program run.

Tsukamoto's idea allows the use of self-modifying code without the pitfalls. He simply suggests using the run-time stack to store modified program segments for execution. Code can be generated by the application program and executed at run-time, yet does not corrupt the program memory. When the code has been executed, it can be thrown away by simply popping the stack.

Neither of these techniques is in common use today, but either one or both of them may eventually find an important application. " --

" Second-Generation Stack Computer Architecture

Charles Eric LaForest?

Submitted for the degree of Bachelor of Independent Studies April 2007


It is commonly held in current computer architecture litera ture that stack-based computers were entirely superseded by the combination of pipelined, integ rated microprocessors and improved compilers. While correct, the literature omits a second, ne w generation of stack computers that emerged at the same time. In this thesis, I develop histo rical, qualitative, and quantitative distinctions between the first and second generations of sta ck computers. I present a rebuttal of the main arguments against stack computers and show that t hey are not applicable to those of the second generation. I also present an example of a small , modern stack computer and compare it to the MIPS architecture. The results show that se cond-generation stack computers have much better performance for deeply nested or recursive code, but are correspondingly worse for iterative code. The results also show that even tho ugh the stack computer’s zero- operand instruction format only moderately increases the c ode density, it significantly reduces instruction memory bandwidth " --

" The main problem in reasoning about stack computers is that t hose usually mentioned in the computer architecture literature, the first generation, ha ve been superseded. They were popular due to their efficient use of hardware, their natural applica tion towards algebraic computation, and the ease with which they could be programmed. Although so phisticated, they were eventu- ally flatly outperformed by the new VLSI microprocessors tha t came out of the Berkeley RISC [PS81] and Stanford MIPS [HJP + 82] projects. The first generation of stack computers can be defined by its support for High-Level Language, mainly ALGOL . This required in-memory stacks primarily used to allocate storage for nested proced ures, and encouraged a large instruc- tion set which attempted to match the semantics of ALGOL clos ely. The goal was to make programming easier in the absence of good compilers.

The second generation of stack computers arose just as the fir st faded away. These comput- ers had simple hardware, high code density, fast subroutine linkage and interrupt response, but garnered little attention since they were aimed at embedded systems instead of general-purpose computing. This separated the second generation from the ma instream of processor design and caused it to become confused with the first generation, furth er discouraging work. The second generation of stack computers can be defined by its support fo r the Forth programming lan- guage. It defines two stacks, Data and Return, which are separ ate from main memory and not randomly addressable. The Data Stack is used to evaluate exp ressions and pass values between procedures in a functional manner. The Return Stack holds th e return addresses of subroutines and can also act as temporary storage. The small instruction set is mostly composed of Forth primitives, which are simple stack manipulations, loads an d stores, and basic arithmetic and logical functions similar to those found in conventional re gister-based computers. The purpose of this thesis is to argue for a distinction of sta ck computers into first and second generations. ... Stacks in a second-generatio n computer resemble registers used for calculations and parameter-passing, while the sta cks of a first-generation machine are effectively call stacks holding procedure activation reco rds. ... Given that past arguments have been found lacking, the compa rison between second-generation stack computers and current register-based computers need s to be revisited. Chapter 6 de- scribes in detail the design of a small, modern stack compute r architecture, named ’Gullwing’, along with a simple optimization to its instruction fetch me chanism which makes practical the use of a single memory bus for both instructions and data. Chapter 7 compares Gullwing to the DLX and MIPS processors us ed as demonstrators by Hennessy & Patterson. ... Chapter 8 addresses Gullwing’s inefficient usage of memory f or holding compiled code by adding a third stack to temporarily hold instructions durin g subroutine calls " --

" 2.3.4 The Need for Index Registers Stack computers execute iterative code poorly compared to general-purpose register comput- ers. Random-access registers can hold the loop indices and intermediate results for immediate access. On the other hand, a stack computer must temporarily move values off the top of the stack to access any index or result that isn’t the most immediate. It is the source of enormous overhead whether or not this generates memory traffic and special-purpose index registers have to be used to reduce it. All first-generation stack computers included some form of index reg- ister " --

" 3.1 Charles H. Moore and the Second Generation The latest wave of second-generation stack computers was almost entirely initiated by Charles H. Moore and documented by Philip J. Koopman, Jr., with some additional unpublished mate- rial made available online by Jeff Fox [Fox04].

... The Forth Programming Language Basis of Second-Generation Stack Comput- ers Much as stack computers from the first generation were based on ALGOL, those from the second generation are derived from the Forth programming language.


A Forth system is divided into two interpreters. The outer interpreter receives source input and looks up each word in a dictionary. If found, it calls the inner interpreter to process the word’s definition. In the most basic case a word definition is a series of addresses of other words, themselves defined in the same manner, ending with primitives which are written in machine code 2 . The inner interpreter is a small virtual machine which walks through these definitions and executes the primitives it encounters. The inner interpreter keeps track of the nesting of these definitions on a stack in memory, commonly referred to as the Return Stack. The Forth primitives do their operations on another such stack, the Data Stack, where they take their arguments and place their results. The primitives are thus simple function applica- tions which can be composed by executing them in sequence. Higher-level words are functional compositions of these primitives. These new words interact with the stack and compose in the same manner as the primitives. A second-generation stack computer is basically a physical realization of the inner inter- preter, the Forth primitives, and the stacks. The primitives become the instruction set which operates on a hardware Data Stack. The inner interpreter reduces to simple call and return instructions which use a Return Stack to store the return addresses of subroutines. " --

" 3.4 Strengths and Weaknesses of the Second Generation The second generation of stack computers still has some of the drawbacks of the first: a need for index registers, stack manipulation overhead, and it additionally supports ALGOL-like lan- guages poorly. However, the second generation also has some distinct advantages: reduced memory bandwidth and system complexity, and faster subroutine linkage and interrupt re- sponse.

3.4.1 The Need for Index Registers As in computers from the first generation (Section 2.3.4), the access of loop indices or interme- diate results which are not immediately on top of the stack requires significant stack manipula- tion overhead. All second-generation stack computers, except the very smallest, mitigate this problem with index registers


3.4.2 Stack Manipulation Overhead Stack computers from the first generation suffered from excessive memory traffic (Section 2.3.3) since their stacks, except for a few working registers, were entirely in main memory. However, this had the advantage of allowing random access to the entire stack. Computers from the second generation keep their stacks separate from main memory and usually on-chip. This has the advantage of causing no memory traffic, but typically limits access to the topmost two to four elements since the stack is no longer addressable. This limitation requires that the topmost elements be permuted to bring a value to the top of the stack (Section 7.3.3). The original overhead of memory traffic is thus transformed into the overhead of extra operations to manipulate the stack. This problem can be mitigated by using more deeply- accessible stacks or more complex stack manipulation operations. This random access comes at the price of increased system complexity, culminating in a conventional multiparty register file.

3.4.3 Poor Support of ALGOL-like Languages Languages which are derived from ALGOL, such as C, use the stack as a means of allocating space for, amongst others, the local variables of a procedure. This entails pushing entire struc- tures onto the stack and then accessing them randomly, often through a pointer. Thus a local variable must be present in main memory since the stack in a second-generation computer is not addressable. Solutions to this problem include more sophisticated compilers [Koo94] [ME97] [ME98], the addition of some form of frame pointer register which can support indexed addressing, or making the return stack addressable such as in the PSC1000/IGNITE I [Sha02] [Sha99].

3.4.4 Reduced Instruction Memory Bandwidth and System Complexity The compact encoding of stack instructions stems from the implicit nature of their operands: It is always the top of the stack. Thus, contrary to a register machine, several such operations can fit into a single memory word. This correspondingly reduces the frequency of instruction fetches. The memory access cycles between instruction fetches can then be used for loads, stores, and fetches, eliminating the need for separate instruction and data memory buses (Sec- tion 7.2.4). This greatly reduces system complexity and allows for a higher internal operating frequency than the memory would normally allows.


3.4.5 Fast Subroutine Linkage and Interrupt Response A stack computer does not need to save the contents of registers upon entering a subroutine. Its parameters are already on top of the data stack, which become its working values throughout the computation, and ultimately remain as one or more return values upon exiting the subrou- tine. The call and return instructions automatically use the return stack for the return address. Subroutine linkage thus requires no additional memory traffic and takes only a cycle or two (Section 7.3.6). An interrupt service routine (ISR) is effectively a hardware-invoked subroutine, and so benefits from the fast linkage. The ISR can simply do its work on top of the stacks so long as it leaves them unaltered upon exit.


" Distinguishing the First and Second Generations

The first generation is based on Bauer’s “stack principle” for subroutine storage allocation (Section 2.1.3) and is exemplified by the Burroughs B5000 series (Section 2.2.3). The second generation originates in Hamblin’s method for evaluating and composing expres- sions (Section 2.1.4), first seen in the English Electric KDF9 (Section 2.2.2), and later inde- pendently rediscovered as the Forth programming language (Section


The first distinguishing feature is the location of the stacks. First-generation stack computers (Figure 4.1) kept their stacks as randomly accessible data structures in main memory, located by an automatically managed stack pointer register....

Later machines used circular buffers between the reg- isters and the memory to accelerate access to items recently put on the stack. Unfortunately, first-generation stack computers were overtaken by register-based machines before this feature became widespread.

General-purpose register computers (Figure 4.2) use the same kinds of stacks, also kept in memory. However, the stack pointer is now an ordinary register, selected by convention, and is manually controlled by software. Software manually loads and stores values from the stack into an arbitrarily accessible register file. Register machines usually place a cache buffer between the registers and memory.

Contrary to both first-generation and general-purpose register computers, second-generation machines (Figure 4.3) keep their stacks in the processor. The stacks are not randomly accessi- ble data structures and only make visible the top two to four elements. The stacks are separate from main memory and only interact with it via load, store, and flow control instructions. The number and size of the stacks are fixed, although they may spill to memory via a pointer, de- pending on the size of the system, and thus behave as their own caches. There are typically only two stacks, shared by all processes, which can internally exchange data amongst themselves.


Use of Stacks: Procedure Nesting vs. Expression Evalu- ation

The second distinguishing feature is the use of the stacks.

First-generation stack computers (Figure 4.1) used stacks as structured temporary storage for program procedures. Each procedure invocation would automatically cause the allocation of an amount of space on the stack to contain (typically) the parameters of the procedure, its return value, its local variables, and the return address of its caller. This area is referred-to as a procedure activation record. Values from the record were loaded and stored into the small internal stack as needed for expression evaluation. The internal stack only held the intermediate results of computations within a given procedure. All linkage between procedures was done solely on the stack in memory.

General-purpose register computers (Figure 4.2) use the same kind of stacks, in the same manner, except that the procedure activation records are manually managed by software and procedure linkage can occur through the registers if the parameters, locals, and return values fit within them.

Second-generation stack computers (Figure 4.3) use separate stacks to control procedure nesting and to perform expression evaluation. The return addresses of procedure calls are stored on a stack dedicated to that purpose (Return Stack). Storing them separately helps to eliminate the division of the other stack (Data Stack) into procedure activation records. Thus, a called procedure finds its parameters (P) on top of the Data Stack, manipulates them directly as locals (L) during expression evaluation, and leaves on top either its return value (R) upon exit or the parameters for calling another procedure. The Data Stack is used for an effectively single, large, and complex expression evaluation whose parts are tracked by the contents of the Return Stack.

Operations with Stacks: High-Level Language Support vs. Primitive Operations

The third distinguishing feature is the operations performed upon the stacks.

First-generation stack computers (Figure 4.1) had built-in hardware and microcode support for high-level language (typically ALGOL) and operating system features. The procedure call and return instructions would automatically allocate and deallocate activation records on the stack. The arithmetic instructions would determine the data type of their operands from special tags. Indirect reference words and data descriptors would allow for lexical scoping across ac- tivation records and resolve call-by-name references. Multiple stacks could be maintained and cross-referenced to enable multitasking and inter-process communication. Descriptors could point to data on disk to support code and data paging-on-demand. The end result was powerful, but extremely complex. These features are well-described in Organick’s book [Org73].

General-purpose register computers (Figure 4.2) have none of these language-specific fea- tures, although they were designed with the efficient support of ALGOL-like languages in mind. A large register file supports fast, simple procedure linkage when possible, and can hold multiple intermediate values during expression evaluation. The instruction set is simple but can use any registers as source and target. High-level language and system features are managed in software by the compiler and the operating system.

Second-generation stack computers (Figure 4.3) have more in common with general-purpose register machines than with first-generation stack computers. The call and return instructions only save and restore the calling procedure’s return address, leaving more complex procedure linkage to software. Arithmetic instructions operate solely on the top few elements of an inter- nal stack and their operands must be loaded from and stored to memory explicitly. Only simple direct and indirect addressing modes are supported, although post-incrementing/decrementing versions are common. Other than the implicit use of stacks for basic procedure linkage and expression evaluation, all high-level language and operating system features are implemented in software in the same manner as in register-based machines. "

" Nevertheless, on a stack computer the hardware must evaluate the expression in only one order, since operands are hidden on the stack, and it may have to load an operand multiple times. [HP02, pg. 93]

true. Operands are hidden on the stack and even with the aforementioned compiler techniques this fact makes for poor performance on iterative code due to the stack manipulations required (Section 7.3.3). However, a register-based computer has the same kind of repeated memory accesses and register-to-register copying overhead when entering or exiting a subroutine (Section 7.3.6). This overhead is virtually nil in stack comput- ers . Finally, there are hints that a redesigned stack computer instruction set could combine stack manipulations with arithmetic operations ’for free’, without adding any new datapath or control lines (Sections 9.2.2 and 9.2.3). "


" The arguments in this section suggest an interesting way of thinking qualitatively about stacks: that they are the ’reciprocal’ (or the ’inverse’) of registers. For example, reading a register does not destroy its contents, but writing does. Conversely, reading from a stack pops information from it, but writing to it simply pushes the existing information down. Up to its maximum capacity, no information is lost. This is a tantalizing symmetry. " --

" Past experiments have shown that an on-chip stack buffer that is 16 to 32 elements deep eliminates virtually all in-memory stack accesses for both expression evaluation and subroutine nesting and that the number of accesses to memory decreases exponentially for a linear increase in hardware stack depth [Koo89, 6.4] [Bai96, 4.2.1] (Appendices B.2.6 and B.2.7). Current Intel processors use exactly such a stack, 16 elements deep, to cache return addresses on-chip [Int06,,], as does the Alpha AXP 21064 processor [McL?93]. Hennessy & Patterson themselves show data supporting this feature [HP96, pg. 277] [HP02, pg. 214]. The SPARC architecture accomplishes a similar results with its register windows. " --

" The Gullwing instruction set is identical to that of the F21 (Section 3.2.5), with some minor implementation simplifications which are beneficial to pipelining (Section 7.4): ...

Address Register (A), the Program Counter (PC), or the Return Register (R).


The PC holds the address of the next memory word to be read for fetching instructions or literals. Upon a subroutine call, it is stored into R, which is itself saved onto the Return Stack (RS). The reverse process occurs when a subroutine returns. The A register is used under program control to hold the addresses of loads and stores, with optional post-incrementing. The R register can be used in the same manner.

... the Arithmetic and Logic Unit (ALU) always takes its inputs from the Top Register (TOP) and the Data Stack (DS) and returns its output to TOP, whose original contents are optionally pushed onto the DS. The TOP register is the central point of gullwing. It is the source and target for memory operations and can exchange data with the A and R registers.


Flow Control

PC@ Fetch next group of instructions into ISR JMP Unconditional Jump JMP0 Jump if top of data stack is zero. Pop stack. JMP+ Jump if top of data stack is positive. Pop stack. CALL Call subroutine RET Return from subroutine

Table 6.2: Gullwing Load and Store Instructions

LIT Push in-line literal onto data stack @A+ Push MEM[A++] onto data stack @R+ Push MEM[R++] onto data stack @A Push MEM[A] onto data stack !A+ Pop data stack into MEM[A++] !R+ Pop data stack into MEM[R++] !A Pop data stack into MEM[A]

Table 6.3: Gullwing ALU Instructions

NOT Bitwise complement of top of data stack AND Bitwise AND of top two elements of data stack XOR Bitwise XOR of top two element of data stack + Sum of top two elements of data stack 2* Logical shift left of top of data stack 2/ Arithmetic shift right of top of data stack +* Multiply step

Table 6.4: Gullwing Stack Manipulation Instructionss

A> ("A From") Push A onto data stack >A ("To A") Pop data stack into A DUP Duplicate top of data stack DROP Pop data stack OVER Push copy of second element onto data stack >R Pop data stack and push onto return stack R> Pop return stack and push onto data stack

Table 6.5: Gullwing No-Op and Undefined Instruction

NOP Do Nothing UND[0-3] Undefined Do Nothing "

his idea for a third stack (i don't understand this; it seems to me that what he describes improves execution speed, not code density, so i must be misunderstanding; perhaps it's because his instructions are only 5 bits long and are packed into 32-bit words? so this is a way to use the rest of the word after a CALL?):

" Improving High-Level Code Density by Adding an In- struction Stack

The key to improving the density of high-level code is the observation that if the instructions to be executed after a subroutine call are placed in the same memory word as the call, they will be fetched along with the call and they should not need to be fetched again when the subroutine returns. The instructions simply need to be temporarily stored in the processor while the subroutine executes. The number of words that need to be stored is identical to the current nesting depth of the program. This suggests extending the Instruction Shift Register (ISR) with a stack that operates in synchrony with the Return Stack (RS). Figure 8.2 illustrates the process. For clarity, only four slots are depicted and the Return Register (R) is omitted (see Figure 6.1). " --

" 7.5 Summary and Performance Comparison In Section 7.2, the comparison of benchmark statistics revealed these facts about Gullwing, relative to DLX/MIPS: • A greater number of literal fetches, subroutine calls, and stack permutations are executed. • An average CPI of 1.31, which is poor compared to the average of 1.14 for a DLX. However, Gullwing is actually architecturally equivalent to a DLX without load delay slots or delayed branches, whose average CPI is 1.38. • An average number of memory accesses per cycle of 0.667, compared to 1.421 for DLX/MIPS (Section 7.2.3). • An average code density of only 1.2 instructions per memory word, out of a potential maximum of three, because most of the instruction slots remain unused. In Section 7.3, a detailed inspection and analysis of equivalent programs which express fun- damental code features uncovered these differences between Gullwing and a generic MIPS-32 computer: • The random-access registers and explicit operands of the MIPS design are a definite advantage when multiple values must be maintained at once in an algorithm. Gullwing must instead execute additional instructions to manipulate the stacks to get to the value it needs. • The MIPS processor must simulate a stack in software for subroutines calls and recur- sive procedures. The extra instructions to implement this stack consume more memory bandwidth and processors cycles than the equivalent Gullwing code. • The Gullwing processor requires less memory bandwidth, often half that of MIPS, re- gardless of the number of cycles required for an algorithm. " --

" 9.2.1 Reducing the DLX/MIPS Subroutine Call Overhead by Adding Stacks

The MIPS architecture has very inefficient subroutine calls compared to Gullwing, stemming from the need to save and restore registers to a stack in memory (Section 7.3.6). Conversely, Gullwing has poor performance for iterative code because it lacks random-access registers. A combination of both might yield the best of both worlds. Figure 9.1 shows a conceptual modification to the MIPS register file which would create the possibility of running stack-like code for efficient subroutine calls, while otherwise leaving the instruction set, architecture, and pipeline unchanged. A Data Stack and a Return Stack are added ’underneath’ some registers. The Return Stack goes under register $31 since it is where the current return address is normally stored by the jal (Jump And Link) instruction. The Data Stack could be placed under any other registers, but is placed here under registers $2 and $1 for illustrative purposes. Two registers are used so that two operands can be taken from the stack when needed. Register $0 is is of course a source of zeroes on reads and a sink on writes. If the stacks are disabled, the register file behaves as usual, and existing code runs un- changed. When the stacks are enabled, a read from a register connected to a stack ($31, $2, and $1) pops the content of the stack into the register, making the read operation destructive, and a write to a register pushes the previous contents of the register onto the stack. The excep- tions to this behaviour are when both $2 and $1 are read together, only the contents of $2 are overwritten by the stack, and if $1 is both read and written in that same instruction, it is simply overwritten as if it had been popped, then pushed.


The stack manipulation instructions, such as DROP and OVER, should never be required. A compiler would use other registers as usual for the storage of counters and common subex- pressions, thus avoiding the stack manipulation overhead of iterative code seen in Section 7.3.3, and avoiding the RAW (Read After Write) dependency of the stack. Conversely, the stack could be used to hold local variables and arguments to subroutines, which would reduce or outright eliminate the loads and stores required for nested subroutine calls on MIPS (Section 7.3.6). " --

Table B.8: Executed Instruction Counts

(i'll only list the entries for the top instructions, in terms of C/I, the fraction out of total instructions for at least one of the two tasks listed, Ext. and VM):

      Ext       VMLIT   .127 .158 + .103 .077 >A    .126 .125 2/ 0 .126 @A    .064 .082 CALL  .064 .050 RET   .064 .075 @A+ .064 .008 XOR   .064 .021 A> .060 .009 FOLDS .042 .074

(FOLDS is something about "instructions...executed concurrently (’FOLDS’) with a PC@.")


B.2.4 Instruction Types

instruction class, instructions: Ext frequency, VM frequency

Stack Manipulation: DUP, DROP, OVER, >R, R>, >A, A> 0.292, 0.233 Arithmetic & Logic: XOR, AND, NOT, 2*, 2/, +, +* 0.169 , 0.277 Fetches: PC@, LIT 0.168, 0.180 Load/Store: @A, @A+, !A, !A+, @R+, !R+ 0.156 , 0.138 Subroutine: CALL, RET, JMP 0.150, 0.152 Conditionals: JMP+, JMP0 0.061, 0.011 Conditionals (Taken): JMP+ TAKEN, JMP0 TAKEN 0.005, 0.010 NOP/UND: NOP, UND[0-3] 0, 0


Table B.12: Data Stack Depth

the average was 2.6 and 3.9 (for the ext and VM benchmarks), the max was 10 and 14 (for the ext and VM benchmarks)

Table B.13: Return Stack Depth

the average was 2.1 and 4.0 (for the ext and VM benchmarks), the max was 7 and 11 (for the ext and VM benchmarks)

At one point, references Chapman's 'stack quarks', defined here: , which appear to be a basis set of operations for dealing with a stack when the top two elements of the stack are held in registers. The basis set is:

DRIP: a b c ... -> a a c ... DIP: a b c ... -> b b c ... NUP: a b c ... -> a b b c ... NIP: a b c ... -> a c ... TUCK: a b c ... -> a b a c ... TAKE: a b c ... -> c b ...


why is accessing variables in the stack faster? is much of it just that it's a simple/efficient 'bump' allocation strategy?


summary of some of the benefits of stack machines:

are those right? what else?


note: what are the benefits of separating data stack and call stack?

one benefit is that the data stack need not be empty upon calls or returns. But how is this different from ordinary parameter passing and return arguments? Only if the contents of the data stack is of variable length, i guess.