Bayle Shanks's website: proj-plbook-plChStackMachines

Table of Contents for Programming Languages: a survey

---

" Many of the designs for these stack computers have their roots in the Forth programming language. This is because Forth forms both a high level and assembly language for a stack machine that has two hardware stacks: one for expression evaluation/parameter passing, and one for return addresses. In a sense, the Forth language actually defines a stack based computer architecture which is emulated by the host processor while executing Forth programs. The similarities between this language and the hardware designs is not an accident. Members of the current generation of stack machines have without exception been designed and promoted by people with Forth programming backgrounds. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec1_5.html (1989)

" Stack machines of all kinds may be classified by a taxonomy based upon the number of stacks, the size of the dedicated stack buffer memory, and the number of operands in the instruction format. The stack machines featured in this book are those with multiple stacks and 0-operand addressing. The size of the stack buffer memory is a design tradeoff between system cost and operating speed. For the bulk of this volume, "stack machines" refers to these particular machines. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec1_6.html

" Historically, computer designs that promise a great deal of support for high level language processing have offered the most hardware stack support. This support ranges from a stack pointer register to multiple hardware stack memories within the central processing unit. Two recent classes of processors have provided renewed interest in hardware stack support: RISC processors, which frequently feature a large register file arranged as a stack, and stack oriented real time control processors, which use stack instructions to reduce program size and processor complexity. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec1_6.html (1989)

" A disadvantage of a single stack is that parameter and return address information are forced to become mutually well nested. This imposes an overhead if modular software design techniques force elements of a parameter list to be propagated through multiple layers of software interfaces, repeatedly being copied into new activation records.

Multiple Stack computers have two or more stacks supported by the instruction set. One stack is usually intended to store return addresses, the other stack is for expression evaluation and subroutine parameter passing. Multiple stacks allow separating control flow information from data operands.

In the case where the parameter stack is separate from the return address stack, software may pass a set of parameters through several layers of subroutines with no overhead for recopying the data into new parameter lists.

An important advantage of having multiple stacks is one of speed. Multiple stacks allow access to multiple values within a clock cycle. As an example, a machine that has simultaneous access to both a data stack and a return address stack can perform subroutine calls and returns in parallel with data operations. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html

" To be competitive in speed, a stack machine must have at least one or two stack elements buffered inside the processor. To see the reason for this, consider an addition operation on a machine with unbuffered stacks. A single instruction fetch for the addition would generate three more memory cycles to fetch both operands and store the result. With two elements in a stack buffer, only one additional memory cycle is generated by an addition. This memory cycle is used to fetch the new second-from-top stack element, filling the hole created by the addition's consumption of a stack argument. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html

" A small stack buffer with primary stacks residing in program memory allows quick switching between stacks for different tasks since the stack elements are predominately memory resident at all times.

The fact that a small dedicated stack buffer is simple to implement and easy to manage makes it very popular. In particular, the fact that most stack elements reside in main memory makes managing pointers, strings, and other data structures quite easy. The disadvantage of this approach is that significant main memory bandwidth is consumed to read and write stack elements. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html

" If an architecture has a large enough stack buffer that main memory bandwidth is usually not consumed to access stack elements, then the architecture has a Large Stack Buffer. This large buffer may take one of several forms. It may be a large set of registers accessed using a register window scheme such as that used by the RISC I processor (Sequin & Patterson 1982), a separate memory unit that is isolated from program memory, or a dedicated stack memory cache in the processor (Ditzel & McLellan? 1982). In any event, the stack buffer is considered "large" if several levels of subroutines (say, 5 or more) may be processed without exhausting the capacity of the stack memory. In the case of a stack that is only used as an expression evaluation stack, "large" may only be approximately 16 elements, since single expressions seldom nest very deeply (Haley 1962). In Chapter 6, we shall examine some program execution statistics that will give more insight into how large is large enough.

An advantage of a large stack buffer is that program memory cycles are not consumed while accessing data elements and subroutine return addresses. This can lead to significant speedups, particularly in subroutine intensive environments.

A disadvantage of a separate stack memory unit is that it may not be large enough for all applications. In this case a spilling of data into program memory to make room for new stack entries may be required. Also, saving the entire stack buffer when switching between tasks in a multitasking environment may impose an unacceptably large context switching overhead, although it should be noted that this can be solved by dividing the stack memory into separate areas for separate tasks. At a lower level, separate data buses for off-chip stack memories and program memory will add pins and expense to a microprocessor. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html

" There are several advantages to the simplicity of 0-operand instructions. One is that only the top one or two stack locations can be referenced by an instruction. This can simplify construction of the stack memory by allowing the use of a single ported memory with one or two top-of-stack registers. A speed advantage may also be gained by loading the operand registers in parallel with instruction decoding, since the operands for each instruction are known in advance to be the top stack elements. This can completely eliminate the need for pipelining to fetch and store operands.

Another advantage is that individual instructions can be extremely compact, with an 8 bit instruction format sufficing for 256 different opcodes. Furthermore, instruction decoding is simplified, since no operand addressing modes need be interpreted by the decoding hardware.

A disadvantage to the 0-operand addressing mode is that complex addressing modes for data structure accessing may take several instructions to synthesize. Also, data elements that are deeply buried on the stack can be difficult to access if provisions are not made for copying the Nth-deep data stack element to the top of the stack. ... 2-operand machines offer a maximum of flexibility, but require more complicated hardware to perform efficiently. Since no operands are known before an instruction is decoded, a data pipeline and dual ported register file must be used to supply operands to the execution unit. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec2_1.html

" In order to convey the category for a particular architecture, we shall use a three character shorthand notation based on the three axes of classification. The first letter of the abbreviation specifies the number of stacks (Single or Multiple). The second letter of the abbreviation specifies the size of dedicated stack memory (Small or Large). The third letter of the abbreviation specifies the number of operands in the instruction format (0, 1, or 2). Thus, the abbreviation SS0 would signify an architecture with a single stack, small dedicated stack memory, and 0-operand addressing and the abbreviation ML2 would signify an architecture with multiple stacks, large dedicated stack memory, and 2-operand addressing. ... 2.3 INTERESTING POINTS IN THE TAXONOMY.

Perhaps the most surprising feature of the taxonomy space is that all twelve categories are populated by designs. This indicates that a significant amount of research on diverse stack architectures has been performed. Another observation is that different machine types tend to group along the operand axis as the major design parameter, with the number and size of stack buffers creating distinctions within each grouping.

The taxonomy categories with 0-operand addressing are "pure" stack machines. Unsurprisingly, these categories have the most academic and conceptual design entries, since they include the canonical stack machine forms. Because of its inherent simplicity, the SS0 machine category is populated by designs constrained by scarce hardware resources, limited design budget, or both. Designs in the SS0 category may have efficiency problems if return addresses and data elements are intermingled on the stack and an efficient deep stack element copy operation is not provided.

The SL0 category seems to only be applicable to special purpose machines used for combinator graph reduction (a technique used to execute functional programming languages (Jones 1987)). Graph reduction requires doing a tree traversal to evaluate the program, using a stack to store node addresses while performing the traversal. No expression evaluation stack is required, since the results are stored in the tree memory itself.

The MS0 and ML0 categories have similarly designed machines, distinguished mainly by the amount of chip or board area spent for buffering stack elements. All the Forth language machines and many of the other high level language machines fall into these categories. Machines in these categories are finding increasing use as real time embedded control processors because of their simplicity, high processing speed, and small program sizes (Danile & Malinowski 1987, Fraeman et al. 1986). Many MS0 and ML0 designs allow for very fast or even zero-cycle subroutine calls and returns.

The entries with 1-operand addressing are designs that attempt to break bottlenecks that may arise in 0-operand designs by altering the pure stack model into a stack/accumulator design. The SS1 designs can use an address to access local variables and frame pointers more easily than SS0 designs. In general, the perceived advantage of a 1-operand design is that a push operation can be combined with an arithmetic operation, saving instructions in some circumstances. Additionally, the Pascal and Modula-2 machines use 1-operand addressing because of the nature of P-code and M-code.

The entries with 2-operand addressing tend to be more mainstream designs. Conventional microprocessors fall into the SS2 category. The RISC designs fall into the SL2 category because of their register windows, and no other designs fall into this category. The MS2 categorization for the 680x0 family reflects the flexibility of that machine, which can use any one of its eight address registers as a stack pointer. The ML2 entry for the PSP machine reflects an attempt in a conceptual design to carry the register window to an extreme for speeding up subroutine calls. The SF1 machine also uses multiple stacks, but dedicates a hardware stack to each active process in a real time control environment.

From the preceding discussion, we see that designs can be found that fall into all twelve categories of the taxonomy. Designs within each group of the taxonomy display strong similarities, while designs in different groups can be shown to have differences that affect implementation and system operation. Thus, the taxonomy is a useful tool for gaining insight into the nature of stack oriented computers.

In the next chapter, we shall focus on a certain sector of the stack machine design space: MS0 and ML0 machines. In all future references, we shall mean either an MS0 or ML0 machine when we use the term "stack machine." " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec2_2.html , http://users.ece.cmu.edu/~koopman/stack_computers/sec2_3.html

Multiple stack, 0-operand machines have two inherent advantages over other machines: 0-operand addressing leads to a small instruction size, and multiple stacks allow concurrent subroutine calls and data manipulations. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec3_1.html

" The amount of stack memory needed by most programs is typically rather small. Furthermore, it can be guaranteed by design to be small in short, time-critical processes. So, even a modest stack buffer of 128 elements can be divided up among four processes with 32 elements each. If more than four processes are needed by the multi-tasking system, one of the buffers can be designated the low priority scratch buffer, which is to be shared using copy-in and copy-out among all the low priority tasks. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec6_5.html#6532

" as processors get faster, the question is: how much stack memory needs to be placed on-chip to obtain good performance? The answer to this question is crucial, since it affects the cost and performance of high-end stack processors that place stack memory on-chip. ...The first question, the one of the size of the on-chip stack buffer, is best resolved by a simulation of various programs with different size stack buffers... As a practical matter, a stack size of 32 will eliminate stack buffer overflows for almost all programs. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec6_4.html

" The stack machine design philosophy falls somewhere in between the philosophies of high level language machine design and RISC design. Stack machines provide very simple primitives which may be executed in a single memory cycle, in the spirit of the RISC philosophy. However, efficient programs on stack machines make extensive use of application specific code that is accessed via cheap subroutine calls. This collection of subroutines may be thought of as a virtual instruction set that is tailored to the needs of high level language compilers, without requiring complicated hardware support. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec6_1.html

" The problem in generating code for stack machines from C is that there are several assumptions about the operating environment deeply entrenched in the language. The most profound of these is that there must be a single program-memory-resident stack that contains both data and subroutine return information. This assumption can not be violated without "breaking" many C programs. As an example, consider the case of a pointer that references a local variable. That local variable must reside in the program memory space, or it can not be properly referenced by the program.

To make matters worse, C programs typically push large volumes of data, including strings and data structures, onto the C-stack. Then, C programs make arbitrary accesses within the area of the current stack frame (the portion of the stack containing variables belonging to the current procedure). These restrictions make it unfeasible to attempt to keep the C stack on a stack machine's Data Stack.

How then can a stack machine be made efficient at running C programs? The answer is that the stack machine must efficiently support frame-pointer-plus-offset addressing into program memory.

...

The other notion often found in C, and other high level languages, that does not map well onto stack machines is that of a "register variable." Since stack machines do not have a set of registers, this implies that compiler optimization opportunities may be missed by stack machines. This is only partially true. While it is true that stack machines are not well suited for juggling a large number of temporary values on the stacks, a small number of frequently accessed values can be kept on the stack for quick reference. For example, these values might include a loop counter kept on the return stack and two addresses for a string compare kept on the data stack. In this manner, most of the efficiency of the hardware can be captured for the majority of C programs. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec7_2.html

" There is evidence that programming languages used to implement rule-based systems, such as those written in Prolog, LISP, and OPS-5 are very well suited to stack machines. One very exciting possibility is the marriage of real time control applications with rule-based system knowledge bases. Preliminary research into this area has been encouraging. ... This speedup of nearly three orders of magnitude is astonishing to some, but merely reflects the suitability of using a stack machine, which is good at tree traversal, for solving problems that use decision trees.

The speedup observed for the rule-based system is actually based on a principle that applies to a wide variety of problem areas. Stack machines can treat a data structure as an executable program. Consider for a moment an example of a tree data structure, with pointers at the internal nodes and program action tokens at the leaves. The nodes of the trees that are pointers can just be the addresses of the children, which equates to subroutine calls in many stack processors. The leaves of the trees can be executable instructions, or subroutine calls to procedures that accomplish some task. A conventional processor would have to use an interpreter to traverse through the tree in search of the leaves. A stack processor can just directly execute the tree instead. Since stack machines execute subroutine calls very quickly, the results can be extremely efficient. The technique of directly executing tree-formatted data structures is responsible for the tremendous speed of the RTX 32P example cited in the previous paragraph. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec7_2.html

" Stack machines are well suited to LISP programming as well as to expert systems. This is because LISP and Forth are very similar languages in many respects. Both treat programs as lists of function calls to other lists. Both are extensible languages. Both use Polish notation for arithmetic operations. The major difference is that LISP involves dynamic storage allocation for its cells, while Forth uses a static storage allocation. Since there is no reason that a stack machine should be any worse at garbage collection than other machines, LISP should run efficiently on a stack machine.

Many of the same arguments about stack machines' suitability for LISP apply to Prolog. In a Prolog implementation for the RTX 32P, this author made an additional discovery about how to efficiently map Prolog onto stack machines. Prolog uses typed data that can be either an actual data element or a pointer to other data. A possible encoding for Prolog data elements is one that uses the highest 9 bits of a 32-bit word for a data type tag. The lowest 23 bits are then used as either a pointer to another node, a pointer to a 32 bit literal value, or a short literal value. Using this data format, data items are actually executable as instructions. Instructions for the RTX 32P can be constructed that allow traversing an arbitrarily long series of pointer dereferences at the rate of one dereference per memory cycle, simply by executing the data structure as a program. Nil pointer checking can be accomplished by defining the nil pointer value to be a subroutine call to an error trapping routine. These kinds of data handling efficiencies are simply not possible with other types of processors.

Functional programming languages offer the promise of a new way of solving problems using a different model of computation than that used by conventional computers (Backus 1978). A particular method of executing functional programs is the use of graph reduction. The same techniques of direct execution of the program graphs that were discussed for rule-based systems above are equally applicable to graph reduction. Thus, stack machines should be good at executing functional programming languages. Belinfante (1987) has published a Forth-based implementation of graph reduction. Koopman & Lee (1989) describe a threaded interpretive graph reduction engine.

From a theoretical point of view, efficient graph reduction machines such as the G-machine and Norma fall into

the SL0 category of the taxonomy in Chapter 2. ML0 machines have a superset of the capabilities of SL0 machines, and should therefore be efficient at graph reduction as well. Initial investigations into this area by this author show that the RTX 32P, which is a very simple stack machine, can compete quite effectively with even very complex graph reduction machines such as Norma.

One of the side effects of using a functional programming language is that a high degree of parallelism is available during program execution. This raises the idea of a massively parallel computer made of stack processors that is programmed with a functional programming language. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec7_2.html

" 7.3 UNIFORMITY OF SOFTWARE INTERFACES

A key conceptual feature of stack machines is their uniformity of interface between high level code and machine instructions. Both procedure calls and opcodes use the stack as a means of passing data. This consistent interface has several positive impacts on software development.

The source code for a program does not have to reflect in any manner which instructions are directly supported by the machine and which functions are implemented as procedures at the Forth language level. This capability suggests the use of a low-level stack language, similar to Forth, that is the target of compilation for all languages. Given an assembler from this target language to the actual machine, the user is freed from worry about how particular functions are implemented. This means that various implementations of the same architecture can be made very compatible at the stack-based source code level without actually having to provide all instructions on low-cost implementations. If the same interface is used for conventional languages as well as Forth, then combinations of C code, Forth code, and code from other languages can be intermingled without problems.

In microcoded machines, such as the RTX 32P, this interface can be exploited one step further. Application specific microcode can be used to replace critical sequences of instructions in the application program, and common compiler generated code sequences transparently to the user. In fact, a common method of application code development on a microcoded stack machine is to first write the entire application in high level code, then go back to microcode the critical loops. The rewriting of subroutines from high level language to microcode is invisible to the rest of the program, except for the speed increase. This speed increase is a speedup factor of approximately two for many applications. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec7_3.html

" Conventional languages can be implemented very easily on stack machines. The only problem is that pure stack machines probably can not perform quite as well as register machines when running conventional programs written in the normal programming style. This problem is mostly one of a mismatch between stack machine capabilities and the requirements of conventional languages. Conventional programs tend to use few procedure calls and large numbers of local variables. Stack machines tend to be good at running programs with many small procedures and few local variables. In part, the difference is due to the programming styles encouraged by common practice and the structure of conventional programming languages. To some extent, the difference is that register machines are well suited to general purpose data processing, whereas stack machines perform best in a real time control environment. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec9_1.html

" 9.1.1 Stack frames

The issue, then, is to identify high level language structures that require additional hardware support. Strangely, the high level language run-time stack is the only major area in which pure stack machines fail to support conventional high level languages. This is because high level languages have the notion of "activation records," which are "frames" of elements pushed onto a software-managed stack in program memory upon every subroutine call. In the usual implementation, each stack frame is allocated only once as a large chunk of memory in the preamble to the subroutine being called. This frame contains the input parameters (which were actually allocated by the calling routine), the user declared local variables, and any compiler generated intermediate variables that might be necessary. During the course of the subroutine, arbitrary accesses are made within the stack frame without actually performing any pushes or pops.

The stack frame is used as a temporary memory allocation device for subroutine calling, not as a traditional pushdown stack for storing intermediate results between successive calculations. This means that it is incompatible with hardware stacks built into stack machines such as those we have been studying. An obvious approach to modify stack machines to meet this requirement is to build the primary stack so as to be allocated in large chunks with random access within a frame. This is precisely how the RISC machines with register windows (described as SL2 machines in Chapter 2) solve the problem. The reason why this does not make sense for stack machines is that all accesses to data pay the penalties associated with having operands in the instruction format.

An alternative approach is to build a secondary hardware stack for accessing local variables at a "slow" speed, with primary data manipulations done on a LIFO hardware data stack. This lets us have the best of both worlds, but is not without cost. That secondary register frame stack will in general have to be 5 to 10 times as big as the data stack for good operating characteristics.

9.1.2 Aliasing of registers and memory

If this were the end of the tradeoff discussion, we might still be tempted to build a chip with on-chip frames. But, there is a deciding factor which tilts the balance in favor of placing the stack frames in program memory. That additional factor is that the semantics of conventional languages allow access to these local variables by memory address. The C language is notorious for this problem, which affects register machines and stack machines alike.

...

It is indeed tempting to write complex compilers in an attempt to keep most local variables on the hardware stack while executing conventional languages. An experiment by this author with some stack machines has shown, however, that the difference between a C compiler that keeps all local variables on the hardware data stack and one that uses program memory with a frame pointer is small. In fact, if the machine has a hardware frame pointer, keeping the frames in program memory is actually somewhat faster.

Normally, one would think that keeping local variables on the hardware data stack would be faster than placing them in memory. The reason this is not so is because stack machines are relatively inefficient at accessing deeply buried elements, especially if they may have been spilled out of the stack buffer and into program memory. Much of the time in the all-on-hardware-stack approach is spent thrashing elements about on the stack. While access to memory-resident frame elements is somewhat slower than access to data on the hardware stack, in the end the savings on stack manipulations make up the difference.

9.1.3 A strategy for handling stack frames

A good compromise approach for handling frames is a mixture of the program memory resident and hardware data stack resident methods. The details of the approach are as follows. All procedure calls place the parameters to be passed on the hardware stack. Since these parameters must be computed or moved from one stack frame location to another using the hardware data stack anyway, no time is lost doing this. The procedure call is then made. The called procedure then copies all but one or two of the parameters from the hardware stack into its newly allocated frame as local variables. The compiler must be sure that the parameters left on the data stack are not referenced by address. This can be accomplished with register variable declarations by the programmer, compiler analysis, or just playing it safe and copying all parameters to memory. Having only one or two parameters on the data stack minimizes stack spills and still provides good efficiency gains. When the procedure terminates, the returned value is placed on the hardware stack.

The compromise approach gives good efficiency by saving a large number of references to memory. It also provides an excellent interface between different languages, such as Forth and C. It is relatively easy to write compilers to handle this approach as well. Also, many stack machines now in existence have a hardware frame pointer, so can easily support this scheme.

" -- http://users.ece.cmu.edu/~koopman/stack_computers/sec9_1.html

" 9.1.4 Conventional language execution efficiency

From this discussion, we can see that stack machines can be reasonably efficient at supporting conventional programming languages with their stack frame approach. Of course, stack machines that use a hardware frame pointer into program memory cannot be expected to be as efficient as RISC machines, since they have massive on-chip register windows for direct frame support or exceedingly clever optimizing compilers that do sophisticated global register allocation.

To the extent that programs in conventional languages conform to the models used by high performance register machines, stack machines will seem to perform poorly. Aspects of conventional programs that cause this effect include: large segments of straight-line code, near-100% cache hit ratios, large numbers of local variables used across long procedures, and shallowly nested procedures that can be compiled as in-line code.

To the extent that programs use structures which run efficiently on stack machines, stack machines can approach or even exceed the performance of register-based machines. Code which is run well by stack machines contains: highly modular procedures with many levels of nesting and perhaps recursion; a relatively small number of frequently used subroutines and inner loops that may be placed in fast memory, and perhaps provided with microcode support; small numbers of local variables passed down through many layers of interface routines; and deeply nested subroutine calls. Also, programs that operate in environments with frequent interrupts and context switches can benefit from using stack machines.

...

One should also keep in mind that there are reasons to select a computer other than raw processing speed for single programs. The reasons that might tilt the balance in favor of stack machines include: interrupt processing speed, task switching speed, low overall system complexity, and the need for application specific microcode and/or hardware support. In the final analysis, stack machines will probably not run many conventional language programs quite as quickly as register-based machines. But, other considerations will largely cancel out the drawbacks, especially for real time control applications, making stack machines an excellent alternative. "

9.3 THE USE OF A THIRD STACK

An often proposed design alternative for stack machines is the use of a third hardware stack. The purposes given for adding a third hardware stack are usually for storage of loop counters and local variables.

Loop counters on the current stack machines are generally kept as the top element of the return address stack. This is because subroutines and loops are mutually well nested, and it is considered bad programming style for a subroutine to attempt to access the loop index of its parent procedure. So, while there is some conceptual merit to having loop indices in their own stack to avoid cluttering the return stack with non-address data, the performance and program effectiveness gains are not sufficient to justify the hardware expense.

Local variable storage is another issue. Even when using the Forth language, programmers have found that the concept of compiler-managed local variables can make some programs easier to create and maintain. In order to do this efficiently, the hardware needs access to a stack that is allocated in frames, with random access to locations with in the frame. This is a requirement that is very like that for supporting conventional languages. So, in fact, the best solution is probably not to have a third hardware stack at all. Rather, stack machines should support a frame pointer into a software-managed program memory stack that is used both for conventional language support and for support for local variables in Forth. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec9_3.html

9.5 TWO IDEAS FOR STACK MACHINE DESIGN

There are two interesting stack machine design details that are not in common usage, but which may prove useful in future designs.

9.5.1 Conditional subroutine returns

One of the design details is an observation by Doran (1972) that stack machines do not need conditional branches, they only need conditional subroutine return instructions. Consider an IF statement in a high level language. If we ignore the optional ELSE clause, an IF statement appears to be a piece of code with one entry point and two exit points. The entry point is the beginning of the statement, where the branch condition is evaluate. The first exit point is if the condition is false, in which case none of the rest of the statement is executed. The second exit point is the end of the statement, when all actions have been completed.

The usual way of implementing an IF statement is by using a conditional branch that is taken if the condition tested by the IF is false. Instead, consider a subroutine that contains the code for the entire IF statement. The entry point to the IF statement is a subroutine call into this special subroutine. The first exit point can be a conditional subroutine return instruction, that only returns if the condition clause of the IF statement is false. The second exit point can be an unconditional return.

What this scheme accomplishes is elimination of conditional branches with embedded addresses. All that is required is a conditional subroutine return statement. This technique is well suited to stack machines, because of the low cost of the initial subroutine call into the IF statement's subroutine and the low cost of the subroutine exits. It may lead to more efficient machines than those currently in use.

9.5.2 Use of the stack for holding code

Another interesting proposal for stack machine program execution was put forth by Tsukamoto (1977). He examined the conflicting virtues and pitfalls of self-modifying code. While self-modifying code can be very efficient, it is almost universally shunned by software professionals as being too risky. Self-modifying code corrupts the contents of a program, so that the programmer cannot count on an instruction generated by the compiler or assembler being correct during the full course of a program run.

Tsukamoto's idea allows the use of self-modifying code without the pitfalls. He simply suggests using the run-time stack to store modified program segments for execution. Code can be generated by the application program and executed at run-time, yet does not corrupt the program memory. When the code has been executed, it can be thrown away by simply popping the stack.

Neither of these techniques is in common use today, but either one or both of them may eventually find an important application. " -- http://users.ece.cmu.edu/~koopman/stack_computers/sec9_5.html

" Second-Generation Stack Computer Architecture

Charles Eric LaForest?