proj-plbook-plChMinLangs

Difference between revision 30 and current revision

No diff available.

Table of Contents for Programming Languages: a survey

and esoteric langs

LOOP

http://web.archive.org/web/20120311032544/http://loopgotowhile.eugenkiss.com/

GOTO

http://web.archive.org/web/20120311032544/http://loopgotowhile.eugenkiss.com/

WHILE

http://web.archive.org/web/20120311032544/http://loopgotowhile.eugenkiss.com/

BASIC

Because of its simplicity and historical popularity, BASIC gets its own chapter.

See chapter on BASIC.

Visual Basic

Tutorials:

Hypercard

Logo

some notes on this history of logo are in the middle of:

http://web.augsburg.edu/~crockett/210/210%20Lab%20Set%204/Reed_OOP_Epistemology.pdf

Jones' A Minimal CISC

http://david.carybros.com/html/minimal_instruction_set.html

8 instructions:

Jones' The Ultimate RISC

http://homepage.cs.uiowa.edu/~jones/arch/risc/

Minimal Forths

http://stackoverflow.com/questions/407987/what-are-the-primitive-forth-operators?rq=1

http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.forth/2005-09/msg00337.html

http://code.google.com/p/subtle-stack/downloads/detail?name=jonesforth.s.txt

http://groups.google.com/group/comp.lang.forth/msg/10872cb68edcb526

" A long time ago, I had a book called "Threaded Interpretive Languages", published I think by Byte, that discussed how to implement a Forth-like language (I don't think they ever called it Forth) in Z80 assembly.

You may not have a Z80 handy, or want one, but the book might be instructive.

+1 for Loeliger's book. The book NEVER said the language was FORTH. Between that book and the Ritter & Walker paper in Byte around the same time, I learned enough to build a reentrant direct-threaded code interpreter for a test set project a few years later at General Dynamics. – John R. Strohm Dec 13 '09 at 4:41

+1 for that book which I love, it was the book which hooked me on the idea of inventing and studying languages – Andy Dent Apr 18 '10 at 10:16

I picked up that book used based on this recommendation. It's great! – Barry Brown

"

Var'aq

A 'Klingon' language. Stack-based.

4 data types:

Interesting stack operations:

Interesting control flow and variable operations:

Interesting list operations:

Interesting string operations:

Links:

---

BF

The real name is Brainf*, a curse word.

There are 8 commands (from http://web.archive.org/web/20060202010638/http://www.iwriteiam.nl/Ha_BF.html ):

Thue

http://esolangs.org/wiki/Thue

MicroMini VM

Stack machine.

https://github.com/jarcane/MicroMini/blob/master/docs/MicroMini.txt

Instructions:

---

related to the York Reduceron

F-lite

F-lite is a subset of Haskell.

It is lazy. It is untyped. Constructor expressions like 'Cons x xs' can be used, but Cons does not have to be defined first.

It supports:

Semicolons are required to separate equations. The function 'main', which returns an integer, is run.

Code examples:

  append Nil ys = ys;
  append (Cons x xs) ys = Cons x (append xs ys);


  zipWith f Nil ys = Nil;
  zipWith f (Cons x xs) Nil = Nil;
  zipWith f (Cons x xs) (Cons y ys) = Cons (f x y) (zipWith f xs ys);

  init (Cons x Nil) = Nil;
  init (Cons x (Cons y ys)) = Cons x (init (Cons y ys));

  pow Nil = Cons Nil Nil;
  pow (Cons x xs) = let { rest = pow xs } in
                      append rest (map (Cons x) rest);


  repeat x = let { xs = Cons x xs } in xs;

  negate n = (-) 0 n;

  fromTo n m = case (<=) n m of {
                 True -> Cons n (fromTo ((+) n 1) m);
                 False -> Nil;
               };


  sayHi k = emit 'h' (emit 'i' (emit '!' k))

Full F-lite program to display the 10th fibonacci number:

  fib n = case (<=) n 1 of {
            True  -> 1;
            False -> (+) (fib ((-) n 2)) (fib ((-) n 1));
          };

  emitStr Nil k = k;
  emitStr (Cons x xs) k = emit x (emitStr xs k);

  main = emitStr "fib(10) = " (emitInt (fib 10) (emit '\n' 0));

F-lite has been extended with generics to form G-lite.

Links:

The reduceron's "template code" (its ISA)

[2] section 2.7 page 8

Example from [3] page 33 PDF page 47:

Flite append function translated from Flite to Template code

Figure 3.3:

append xs ys = case (xs) of {
Cons x xs -> Cons x (append xs ys);
Nil       -> ys; }
main = append Nil Nil;

(0,[FUN 2 1,CON 0 1, PTR 0],[[CON 0 1]])
(2,[ARG 0, TAB 2, ARG 1],[])
(2,[ARG 1],[])
(4,[CON 2 0, ARG 0, PTR 0],[[FUN 2 1, ARG 1, ARG 3]])

" The template number zero (0,[FUN 2 1, CON 0 1, PTR 0],[[CON_0_1?]]) , is the main program, of arity zero. The atom FUN 2 1 indicates a function of arity two and index one, FUN 2 1 is applied to two arguments CON 0 1 which is the encoding for Nil . The atom PTR 0 is a reference to the off-spine application number zero, which contains the second argument of the function append.

Now, the template number one is the function append of arity two (2,[ARG 0, TAB 2, ARG 1],[]) . In the spine application ARG 0 is the case subject xs , the TAB 2 atom is a reference to the case alternatives, for example if the case subject ARG 0 evaluates to Nil then the second template will be evaluated. Otherwise, the third template will be evaluated ((i don't understand this, why shouln't 'CON 2 0' (cons) cause the first one to be evaluated, and 'CON 0 1' (nil) cause the second one to be evaluated?)). The atom ARG 1 is the free variable ((ys)) used in all case alternatives.

The template number three is the recursive case alternative (4,[CON 2 0, ARG 0, PTR 0], [[FUN_2_1,_ARG_1,_ARG_3?]]) . The arity of this template is four, the reason is the special encoding which includes the two arguments ARG 0 and ARG 1 representing x and xs , an argument for case table and one free variable. The first atom CON 2 0 is a constructor which expects two arguments and the index of this constructor is zero. The first argument is ARG 0 and the second is a pointer the application [FUN 2 1, ARG 1, ARG 3] . This application corresponds to the recursive application at the high-level (append xs ys) . "

Types of atoms:

data Atom
 = FUN Arity Int   // Function with arity a and address i
 | ARG Int         // Reference to the i-th the function argument
 | PTR Int         // Pointer to the i-th application
 | CON Arity Int   // Constructor of arity a and index i
 | INT Int         // Integer literal
 | PRI String      // Primitive function (named)
 | TAB Int         // Case table

[4] and [5]

"Each template can be viewed as a low-level encoding of a let expression. The let-bindings are encoded as off-spine applications and the body of the let as the spine-application . Each template ( arity;app;apps ) in Reduceron template code is a tuple of 'arity', spine application 'app' and off-spine applications 'apps'. Instead of names the results of the off-spine applications are referred to by position (eg. PTR 0 for the first off-spine application)." [6]

relating to the CON (constructor) atom, "A variant of Scott/Jansen encoding combines data construction and case selection. A constructor of datatype d translates to a function that in addition to components takes as arguments alternative continuations for each possible d-construction." [7].

The rules for how a computation proceeds in this template language are given in The Reduceron Reconfigured section 3 page 4, or The Reduceron Reconfigured talk slides at ICFP 2010, slides starting with 'Reduceron machine state', or Static Methods to Check Low-Level Code for a Graph Reduction Machine by Marco Polo Perez Cervantes (thesis) section 3.4 page 34 PDF page 48, or The Reduceron reconfigured and re-evaluated by Matthew Naylor and Colin Runciman section 3 page 11 (also at https://github.com/tommythorn/Reduceron/blob/666e78d1ec4a9fd42438a6f4f45bd861a5d2c1b0/semantics/Reduceron.lhs#L127).

To summarize, the machine has a program code memory (the templates), a heap (storing various expressions; PTRs are indices into the heap), a reduction stack (the current expression being reduced), and an update stack (stating where on the heap to write back the current expression after it has been reduced). When the TOS (top of stack) of the reduction stack is a PTR, the PTR is dereferenced and the corresponding expression is copied from the heap onto the TOS of the reduction stack, and an entry is pushed onto the update stack to remember to save the result back to that heap location after this expression has been reduced. At each step, we check to see if the TOS of the reduction stack has been reduced, and if so, we write it back to the heap as directed by the update stack. Otherwise, we evaluate primitives, apply functions, and turn case tables into functions to be applied.

Links:

---

Simpl/Com

" The syntax of Simpl is defined by the polymorphic data type ( 0 s, 0 p, 0 f ) com , where 0 s Definition 2.1 Syntax of Simpl I is the state space type, 0 p the type of procedure names and 0 f the type of faults. The constructors are listed in the following table, where c , c 1 and c 2 are Simpl commands of type ( 0 s, 0 p, 0 f ) com and the Boolean condition b and the guard g are modelled as state sets of type 0 s set .

Skip Do nothing Basic f Basic command, where f is a state-update: 0 s ⇒ 0 s Seq c 1 c 2 Sequential composition, also written as c 1 ; c 2 Cond b c 1 c 2 Conditional statement (if-then-else) While b c Loop Call p Procedure call, p is of type 0 p Guard f g c Guarded command, fault f is of type 0 f Throw Initiate abrupt termination Catch c 1 c 2 Handle abrupt termination Spec r Basic (nondeterministic) command defined by the relation r of type ( 0 s × 0 s ) set DynCom? c s Dynamic (state dependent) command, where c s has type 0 s ⇒ ( 0 s, 0 p, 0 f ) com

... Runtime faults are modelled by the guarded command Guard f g c , where g is the guard that watches for runtime faults that may occur in c . If g is violated the fault f is signalled and the computation stops.

The core procedure call Call p is parameterless. Parameter passing is imple- mented in section 2.4 .

Throw and Catch are the basic building blocks to model various kinds of abrupt termination, like break , continue , return and exceptions. In Catch c 1 c 2 the com- mand c 2 can be seen as handler. It is only executed if c 1 terminates abruptly.

Command Spec r defines the potential next states via the relation r . It can be used to model nondeterministic choice or in the context of refinement [ 72 , 28 ] it can represent a specification of a piece of code rather than an implementation.

The dynamic command DynCom? c s allows to abstract a command over the state space. It can be used for di ff erent purposes: To implement scoping, parameter passing, expressions with side-e ff ects or “real” dynamic construct like pointers to procedures or dynamic method invocations. Details follow in section 2.4 .

The set of Simpl commands is not minimal. A Skip can be implemented by Basic ( λ s . s ), the dynamic command DynCom? can be used to implement the condi- tional, and the Spec command can be used to implement the Basic command. This separation reflects the di ff erent concepts behind the commands."

...

2.4.2 Concrete Syntax for Simpl To improve readability of Simpl programs I introduce pseudo-code syntax. First of all let me address the assignment. With the state represented as record, an assignment m = m - 1 is mapped to a Basic command that performs a record update:

Basic ( λ s . s (

m : = m s − 1)).

...

concrete syntax abstract syntax SKIP Skip m := e Basic ( λ s . s (

c 1 ; c 2 Seq c 1 c 2 IF b THEN c 1 ELSE c 2 FI Cond { s . b } c 1 c 2 IF b THEN c 1 FI Cond { s . b } c 1 Skip WHILE b DO c OD While { s . b } c TRY c 1 CATCH c 2 END Catch c 1 c 2 g → c Guard False g c {
m : = e ))
P } { s . P }

--- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.128.388&rep=rep1&type=pdf , section 2.1, page 16, PDF page 30

" Basic models the atomic transitions in the embedded language’s state space, e.g. variable assignment or heap update.

Seq provides sequential composition,

Cond and While conditional and iterative statements respectively. It should be observed that the condition expressions are side-effect free.

Spec can be used to describe non-deterministic transitions, but we do not make any use of this in the following.

Function call semantics map to Call and DynCom? statements.

Throw and Catch give an exception mechanism with which one can model abrupt termination in language statements like return .

Finally, Guard s lead to failure semantics if guard expressions do not hold prior to execution of parameter programs. We drop the χ parameter when describing Guard programs in the following, as from a soundness point-of-view the cause of failure semantics is irrelevant " -- [8]

---

HITECO

A pedagogical self-compiling compiler.

[Chapter 21, Self-compiling Compilers, of Symbolic Processing In Pascal, by Professor Manfred von Thun

---

Bcompiler

A pedagogical self-compiling compiler by Edmund Grimley Evans.

https://github.com/certik/bcompiler

There were a number of bootstrapping stages, divided into 5 larger phases labeled HEX1 through HEX5.

In HEX4, there are labels, 'push constant', 'push label address', 'call label address', and the following predefined functions:

^ ~

(and jump, which doesn't appear to be in that table; and presumably other stuff)

HEX5 contains (presumably in addition to HEX4): string constants, named variables, if/then/else, while loops with break and continue, and probably other stuff too.

Links

---

Lox

Pedagogical language from Crafting Interpreters

http://craftinginterpreters.com/the-lox-language.html

C-style syntax. Dynamic. Automatic memory management.

Types: booleans, numbers, strings, nil

Statements and expressions.

Expressions:

+, -, *, /, also - (as arithmetic negation)

< <= > >= == !=

! and or ('and' and 'or' are short-circuiting and so are also control flow operators)

statements:

'print'

blocks {}

grouping by parens ()

variable declaration with 'var'

variable assignment with '='

if, while, for

functions. Define functions with 'fun' keyword. Functions are first-class. Closures.

classes with (single inheritance) subclasses, fields, methods, 'this', 'init' methods, 'super'. The syntax for class declaration is the 'class' keyword. The syntax for construction is just use the classname, eg "var breakfast = Breakfast();" The syntax for subclasses is '<', eg:

" class Brunch < Breakfast { drink() { print "How about a Blood Mary?"; } } "

stdlib: 'print', 'clock'

token types:

" enum TokenType? { Single-character tokens. LEFT_PAREN, RIGHT_PAREN, LEFT_BRACE, RIGHT_BRACE, COMMA, DOT, MINUS, PLUS, SEMICOLON, SLASH, STAR,

  // One or two character tokens.
  BANG, BANG_EQUAL,
  EQUAL, EQUAL_EQUAL,
  GREATER, GREATER_EQUAL,
  LESS, LESS_EQUAL,
  // Literals.
  IDENTIFIER, STRING, NUMBER,
  // Keywords.
  AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
  PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,
  EOF} " -- [9]

---

Lithp

https://web.archive.org/web/20050219051229/http://www.cis.rit.edu/~jerry/Software/lithp/README

"a basic, tiny LISP implementation. It was created to be a configuration/logic file format for a game I am working on."

Variables, Functions:

Numbers:

Comparisons:

Evaluations:

Lists:

Output:

---

META II 'order codes'

the following is quoted from [10], following figures 6.1 and 6.2 of [11] D. V. Schorre, "META II: A Syntax-Oriented Compiler Writing Language," Proceedings of the 19th ACM National Conference (1964), ACM Press, New York, NY, 41.301-41.3011.] http://doi.acm.org/10.1145/800257.808896

Below is a list of all of the operators of the META II parsing machine as given in [Schorre64]. These are called order codes just as the assembly language operations of a computer of that era were called. This is the language that appears in the Code window. These are the op codes of the META II virtual machine. These are the objects and operations of the domain of META II compiling. They are complete in themselves. To move Meta II to a different language these and only these are the objects and operations that must be implemented in the target language.

Mnemonic Purpose Actions

    label 1 cell, initialized to blank
    label 2 cell, initialized to blank
    location cell, set to the return from call location

META II Pseudo Operations (following figure 6.3 of [Schorre64]):

---

VALGOL II 'order codes'

I think VALGOL II is implemented with META II in the paper [12] D. V. Schorre, "META II: A Syntax-Oriented Compiler Writing Language," Proceedings of the 19th ACM National Conference (1964), ACM Press, New York, NY, 41.301-41.3011.] http://doi.acm.org/10.1145/800257.808896 .

NOTE 1: If the top item in the stack is an address, it is replaced by its contents before beginning this operation

NOTE 2: Same as note 1 but applies to the top two items

CONSTANT AND CONTROL CODES:

---

IBNIZ

A VM designed for size-constrained demoscene 'demos'; "a programming language that generates video and audio from very short strings of code".

" two-stack machine somewhat similar to Forth, but with the major execption that the stack is cyclical and also used at an output buffer. Also, as every IBNIZ program is implicitly inside a loop that pushes a set of loop variables on the stack on every cycle, even an empty program outputs something (i.e. a changing gradient as video and a constant sawtooth wave as audio). ... runs the program in two separate contexts with separate stacks and internal registers: the video context and the audio context ... The scheduling and main-looping logic is the only somewhat complex thing in IBNIZ. All the rest is very elementary, something that can be found as instructions in the x86 architecture or as words in the core Forth vocabulary. Basic arithmetic and stack-shuffling. Memory load and store. An if/then/else structure, two kinds of loop structures and subroutine definition/calling. Also an instruction for retrieving user input from keyboard or pointing device. Everything needs to be built from these basic building blocks. And yes, it is Turing complete, and no, you are not restricted to the rendering order provided by the implicit main loop. " [13]

" Word width: 32 bits (arithmetic in 16.16 fixed-point) Address space: 2^20 words (4 megabytes, ~3 of which free user RAM) Video output: 256x256 pixels at 60 Hz, 32 bits per pixel (VVUU.YYYY) Audio output: 61440 Hz mono (30720 Hz stereo), 16 bits per sample " [14]

Instructions (from [15]):

---

TIS-100

From the game TIS-100

http://www.zachtronics.com/images/TIS-100P%20Reference%20Manual.pdf

https://unwiredben.livejournal.com/308306.html

https://alandesmet.github.io/TIS-100-Hackers-Guide/assembly.html

---

EXA

From the game EXAPUNKS

http://exapunks.wikia.com/wiki/EXA_instructions

---

LC-3

Pedagogical language

Variants:

"The LC-3b ISA describes a modified version of the LC-3 that includes the following changes:

    The machine's word size remains 16 bits, but its memory is now byte-addressable with the same address space.
    The LD and ST instructions (load and store data using PC-relative addressing) have been removed.
    The LDI and STI instructions (indirect loads and stores) use register-based addressing instead of PC-relative addressing.
    Two instructions, LDB and STB, have been added to manipulate individual bytes of memory; the other load and store instructions continue to act on entire words.
    The reserved opcode has been converted into a shift instruction, SHF, that supports arithmetic and logical shifts of arbitrary size in both directions.

" [16]

---

StoneKnifeForth

https://github.com/kragen/stoneknifeforth

bootstrap interpreter in Python:

(main part of?) StoneKnifeForth?:

"a very simple language inspired by Forth. It is not expected to be useful; instead, its purpose is to show how simple a compiler can be. The compiler is a bit under two pages of code when the comments are removed. " -- [17]

"

compile_time_dispatch = { '(': eat_comment, 'v': dataspace_label, ':': define_function, 'b': literal_byte, '#': literal_word, '*': allocate_space, '^': set_start_address, '[': start_conditional, ']': end_conditional, '{': start_loop, '}': end_loop, ' ': eat_byte, '\n': eat_byte, "'": skip_literal_byte, }

...

run_time_dispatch = { '(': jump, 'W': write_out, 'G': read_byte, 'Q': quit, '-': subtract, '<': less_than, '@': fetch, '!': store, # 'f': fetch_byte, not yet needed 's': store_byte, ';': return_from_function, '[': conditional, ']': nop, '{': nop, '}': loop, ' ': nop, '\n': nop, "'": literal_byte, }" -- https://github.com/kragen/stoneknifeforth/blob/master/tinyboot.py

" It’s a dumbed-down Forth with no compile-time execution, in which only the first byte of each word is significant, with some heterodox names for functions. Bytes preceded by ' are literals for their numerical value, e.g. “'A” is 65.

What that means, if you know how to program but don’t know Forth:

Each word of the program is generally a subroutine call. Words are separated by spaces. There are no types. There’s an implicit stack; words pop their input off the stack and push their output onto it. Numbers and literal bytes are pushed onto the stack.

So “2 3 + 4 -” calculates 2 + 3 - 4; “2 3 4 + -” calculates 2 - [3 + 4]. “'Z 'A -” calculates the ASCII value of “Z” minus the ASCII value of “A”, i.e. it calculates 25. [“+” isn’t in the base language implemented by the compiler; it’s a user-defined function. “-” is in the base language.]

“: foo” marks the beginning of a subroutine called “foo”; “;” is the return instruction. So “: . %flush u ;” defines a subroutine called “.” which consists first of calling “%flush”, then calling “u”, then returning.

“var foo” marks the beginning of a variable named “foo”; the word “foo” thereafter will push its address onto the stack. “# 37” includes the number 37 in the program at that point as a 32-bit number, so “var X # 0” creates a variable “X” and puts four bytes of zeroes there.

“@” and “!” fetch from and store to a memory address respectively. So “var X # 0 : dup X ! X @ X @ ;” creates a variable X, gives it 32 bits of storage initialized to 0, and defines a subroutine called “dup”. “dup” takes whatever is on the stack, stores it in those 32 bits of storage at “X”, then fetches it out of X twice, leaving two copies of it on the stack.

Combining all of the above, “var HERE # 0 : forward HERE @ + HERE ! ;” fetches the value at the location “HERE”, adds whatever was on the stack to it, and stores the result back at “HERE”.

Compile-time primitives: ( — defines a comment that extends to the next right-paren v — defines a dataspace label, like for a variable. b — compiles a literal byte, numerically, into data space.

  1. — compiles a literal four-byte little-endian number into data space. ^ — the location where the program should start executing [everything else is just definitions] space, newline — ignored
— defines a function [ — begins a conditional block, which consumes the value on top of the stack and is skipped if that value is zero ] — ends a conditional block, matching the latest unmatched [ { — begins a loop } — ends a loop — consumes the value on top of the stack and loops back to the matching { if it is nonzero

Run-time primitives: G — gets a byte from stdin W — given an address and size on the stack, writes the specified number of bytes to stdout Q — exits the program with a return code of 0

Defined in this file, all necessarily run-time: h — the ELF header e — the location of the entry point word in the ELF header o — the location of the word in the ELF program header that specifies the .ORG S — the location of the word in the ELF program header that specifies filesize $ — the start of the area where the program gets compiled + — addition Z — the end of the predefined blob of bytes to output, with slight changes H — the point where the program is currently getting compiled f — increases H by its argument X, Y — two temporary variables d — DUP p — DROP x — SWAP

— an equality test

> — a greater-than test D — dispatch a byte being compiled R — compile the sequence `xchg %ebp, %esp`, buffered by a peephole optimizer . — add a byte to the program being compiled, flushing the peephole buffer u — add a byte to the program being compiled, used to implement the peephole buffer U — the peephole buffer itself, a boolean storing whether there’s a pending `xchg %ebp, %esp` T — table of types of definitions A — table of addresses of definitions F — multiply by four C — register a Colon definition in the tables V — register a Variable definition in the tables a — translate an address from compiler address space to user address space & — compile a byte if it’s a colon or variable definition O — boolean OR N — boolean NOT w — like G, but skips whitespace B — a buffer for w E — stick the item on top of the stack at H as four bytes, advancing H by 4 L — compile code to push the constant on the compile-time stack t — one step in base-ten decoding: x <- 10x + y n — parse a number from input and push it on the compile-time stack J — compile a forward conditional jump [ P — resolve a forward conditional jump ] j — compile a backwards negative conditional jump } M — the location in the compiler’s memory for the system call routine " -- [18]

" The entire “trimmed” source code is 1902 bytes, which is less than half the size of the nearest comparable project that I’m aware of, otccelf, which is 4748 bytes.

As demonstrated by the interpreter written in Python, the programming language itself is essentially machine-independent, with very few x86 quirks:

...

Why? To Know What I’m Doing ... "Alan Kay frequently expresses enthusiasm over the metacircular Lisp interpreter in the Lisp 1.5 Programmer’s Manual. ... But if you try to implement a Lisp interpreter in a low-level language by translating that metacircular interpreter into it...you run into a problem. The metacircular interpreter glosses over a large number of things that turn out to be nontrivial to implement — indeed, about half of the code is devoted to things outside of the scope of the metacircular interpreter. Here’s a list of issues that the Lisp 1.5 metacircular interpreter neglects, some semantic and some merely implementational:

...

A metacircular compiler forces you to confront this extra complexity. Moreover, metacircular compilers are self-sustaining in a way that interpreters aren’t; once you have the compiler running, you are free to add features to the language it supports, then take advantage of those features in the compiler.

So this is a “stone knife” programming tool: bootstrapped out of almost nothing as quickly as possible.

Why? To Develop a Compiler Incrementally

When I wrote Ur-Scheme, my thought was to see if I could figure out how to develop a compiler incrementally, starting by building a small working metacircular compiler in less than a day, and adding features from there. I pretty much failed; it took me two and a half weeks to get it to compile itself successfully.

Part of the problem is that a minimal subset of R5RS Scheme powerful enough to write a compiler in — without making the compiler even larger due to writing it in a low-level language — is still a relatively large language. Ur-Scheme doesn’t have much arithmetic, but it does have integers, dynamic typing, closures, characters, strings, lists, recursion, booleans, variadic functions, let to introduce local variables, character and string literals, a sort of crude macro system, five different conditional forms (if, cond, case, and, or), quotation, tail-call optimization, function argument count verification, bounds checking, symbols, buffered input to keep it from taking multiple seconds to compile itself, and a library of functions for processing lists, strings, and characters. And each of those things was added because it was necessary to get the compiler to be able to compile itself. The end result was that the compiler is 90 kilobytes of source code, about 1600 lines if you leave out the comments.

Now, maybe you can write 1600 lines of working Scheme in a day, but I sure as hell can’t. It’s still not a very large compiler, as compilers go, but it’s a lot bigger than otccelf. So I hypothesized that maybe a simpler language, without a requirement for compatibility with something else, would enable me to get a compiler bootstrapped more easily.

So StoneKnifeForth? was born. It’s inspired by Forth, so it inherits most of Forth’s traditional simplifications:

And it added a few of its own:

Surprisingly, the language that results is still almost bearable to write a compiler in, although it definitely has the flavor of an assembler.

Unfortunately, I still totally failed to get it done in a day. It was 15 days from when I first started scribbling about it in my notebook until it was able to compile itself successfully, although git only shows active development happening on six of those days (including some help from my friend Aristotle). So that’s an improvement, but not as much of an improvement as I would like. At that point, it was 13k of source, 114 non-comment lines of code, which is definitely a lot smaller than Ur-Scheme’s 90k and 1600 lines. (Although there are another 181 lines of Python for the bootstrap interpreter.)

It’s possible to imagine writing and debugging 114 lines of code in a day, or even 300 lines. It’s still maybe a bit optimistic to think I could do that in a day, so maybe I need to find a way to increase incrementality further.

My theory was that once I had a working compiler, I could add features to the language incrementally and test them as I went. So far I haven’t gotten to that part.

...

Related work

Andre Adrian’s 2008 BASICO: http://www.andreadrian.de/tbng/

        is a small imperative programming language that is just powerful enough to compile itself (compiler bootstrapping).
        has no GOTO, but has while-break-wend and multiple return
        has C-like string handling.
        is implemented in less then 1000 source code lines for the compiler.
        produces real binary programs for x86 processors, not P-code or Byte-Code.
        uses the C compiler toolchain (assembler, linker)
        uses C library functions like printf(), getchar(), strcpy(), isdigit(), rand() for run-time support.

Actually it produces assembly, not executables.

Version 0.9 was released 15 Jul 2006. The 1000 source lines include a recursive-descent parser and a hand-coded lexer.

Sample code:

return 1 if ch is in s, 0 else func in(ch: char, s: array char): int var i: int begin i = 0 while s[i] # 0 do if ch = s[i] then return 1 endif i = i + 1 wend return 0 end

FIRST and THIRD, from the IOCCC entry.

Ian Piumarta’s COLA system.

Oberon.

Fabrice Bellard’s OTCC.

F-83.

eForth, especially the ITC eForth.

Jack Crenshaw’s Let’s Build a Compiler. This is a how-to book that walks you through an incrementally-constructed compiler for a toy language, written in Pascal, in about 340 pages of text. The text is really easy to read, but it will still take at least three to ten hours to read. It uses recursive-descent parsing, no intermediate representation, and it emits 68000 assembly code.

Ikarus.

Ur-Scheme.

PyPy?.

Bootstrapping a simple compiler from nothing: Edmund GRIMLEY EVANS 2001 http://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html

" -- [19]

" /* StoneKnifeForth? (currently) only uses the following 23 * instructions, in decimal: * * - 15 157 192: setge %al * - 15 182 0: movzbl (%eax), %eax * - 15 190 192: movsbl %al, %eax * - 41 4 36: sub %eax, (%esp) * - 80: push %eax * - 88: pop %eax * - 89: pop %ecx * - 90: pop %edx * - 91: pop %ebx * - 116: jz (short, PC-relative) (1-byte operand) * - 117: jnz (short, PC-relative) (1-byte operand) * - 129 237: subtract immediate from %ebp (4-byte operand) * - 133 192: test %eax, %eax * - 135 236: xchg %ebp, %esp * - 136 8: mov %cl, (%eax) * - 137 229: mov %esp, %ebp * - 139 0: mov (%eax), %eax * - 143 0: popl (%eax) * - 184: load immediate %eax (e.g. mov $12345678, %eax) (4-byte operand) * - 195: ret * - 205 128: int $0x80 * - 232: PC-relative call to absolute address (4-byte operand) * - 254 200: dec %al * * Moreover, jz and jnz are always preceded two instructions * previously by a test instruction, and setge is always preceded two * instructions previously by a sub instruction, so of the flags, only * ZF and SF are used, and their state is never saved or restored or * transferred across jumps. * */ " -- [20]

" Register-use conventions: `%eax` — the top of the operand stack `%esp` — normally, points to the rest of the operand stack in memory, in the usual `%esp` way: growing downward, predecrement, postincrement. Temporarily swapped with `%ebp` during `call` and `ret` instructions. `%ebp` — normally, points to the return stack in memory, in the usual `%esp` way. `%ecx` — occasionally used as a scratch register Other registers are only used to pass arguments to Linux system calls. ) " -- [21]

Inertia register machine

https://github.com/Conceptual-Inertia/Inertia/

https://github.com/Conceptual-Inertia/Inertia/blob/master/docs/instr_file.txt

used in A Performance Survey on Stack-based and Register-based VirtualMachines

i am assuming the following details from reading the above-referenced instr_file.txt; i could have misunderstood:

16 instructions: add div mul ltn (less-than) eql and not or inc dec print load goto if (conditional skip) return call

4 registers

3-address code

the three operands fields have 4 modes: immediate constant, register direct, absolute address in memory, register indirect


Wirth p-code example machine

https://en.wikipedia.org/wiki/P-code_machine#Example_machine


CHIFIR

http://www.vpri.org/pdf/tr2015004_cuneiform.pdf

Table 1, pg. 10:

" Table 1. Chifir instruction set Opcode Semantics 1 PC←M?[A] 2 If M[B] = 0, then PC←M?[A] 3 M[A]←PC 4 M[A]←M[B] 5 M[A]←M[M[B]] 6 M[M[B]]←M[A] 7 M[A]←M[B] + M[C] 8 M[A]←M[B] - M[C] 9 M[A]←M[B]×M[C] 10 M[A]←M[B]÷M[C] 11 M[A]←M[B] modulo M[C] 12 If M[B] <M[C], then M[A]←1, else M[A]←0 13 M[A]←NOT(M[B] AND M[C]) 14 Refresh the screen 15 Get one character from the keyboard and store it in M[A] "

---

L2

https://github.com/murisi/L2

---

T3X

http://t3x.org/t3x/

---

INTCODE

http://www.gtoal.com/languages/bcpl/amiga/bcpl/booting.txt

---

Linoleum

https://en.wikibooks.org/wiki/Linoleum

---

Maru

https://www.piumarta.com/software/maru/

---

MIPSter

http://www.cs.uni-salzburg.at/~ck/content/publications/conferences/Onward17-Selfie.pdf

" MIPSter is a tiny subset of MIPS32. There are only 16 MIP-Ster instructions (nop,addu,subu,multu,divu,mfhi,mflo,slt,jr,syscall,addiu,lw,sw,beq,jal,j). MIPSter re-flects what is needed to implement C*. In particular, thereare no bitwise instructions and no sub-word data transferinstructions.The only issue is with signed integer literals in C* (32 bits)that are too large to fit as immediate values in MIPSter (16bits). Such values are decomposed by the compiler into thelargest numerals in base two that can still be loaded into reg-isters and then reconstructed with the available instructions.MIPSter itself can be taught in around two weeks of classeseven to audiences with little architecture background "

From the Selfie project

---

Woolite

Educational language in a compiler course.

http://www.cs.williams.edu/~tom/courses/434/project/woolite/Woolite.pdf

---

Pocketlang

https://thakeenathees.github.io/pocketlang/getting-started-learn-in-15-minutes.html https://thakeenathees.github.io/pocketlang/index.html

---

Bauer and Pretnar's Programming Languages Zoo

https://plzoo.andrej.com/index.html

"a collection of miniature programming languages which demonstrates various concepts and techniques used in programming language design and implementation"

---

nand2tetris Jack

https://www.csie.ntu.edu.tw/~cyy/courses/introCS/14fall/lectures/handouts/lec13_Jack_4up.pdf

" Jack language specification

(for complete language specification, see the book).

Jack syntax ...

Jack data types

Primitive types (Part of the language; Realized by the compiler):

Abstract data types (Standard language extensions; Realized by the OS / standard library):

Jack variable kinds and scope ...

A Jack expression is any one of the following:

(boolean and and or operators, bit-wise)

Jack Statements let varName = expression; or let varName[expression] = expression;

if (expression) { statements } else { statements }

while (expression) { statements } do function-or-method-call;

return expression; or return;

Jack subroutine calls General syntax: subroutineName(arg0, arg1, ... ) where each argument is a valid Jack expression Parameter passing is by-value (primitive types) or by-reference (object types)

Example 1: Consider the function (static method): function int sqrt(int n) This function can be invoked as follows: sqrt(17) sqrt(x) sqrt((b * b) – (4 * a * c)) sqrt(a * sqrt(c ‐ 17) + 3) Etc. In all these examples the argument value is computed and passed by-value

Example 2: Consider the method: method Matrix plus (Matrix other); If u and v were variables of type Matrix, this method can be invoked using: u.plus(v) The v variable is passed by-reference, since it refers to an object.

Noteworthy features of the Jack language

Q: Why did we introduce these features into the Jack language? A: To make the writing of the Jack compiler easy! Any one of these language features can be modified, with a reasonable amount of work, to make them conform to a more typical Java-like syntax.

Jack program structure class ClassName? { field variable declarations; static variable declarations; constructor type { parameterList ) { local variable declarations; statements } method type { parameterList ) { local variable declarations; statements } function type { parameterList ) { local variable declarations; statements } }

About this spec:

A Jack program:

" -- [22]

Jack variable kinds are: static, field (ie an object's field), local, parameter variables

Hurl

https://ntietz.com/blog/introducing-hurl/

all control flow via resumable exceptions

GUIX boostrapping

https://guix.gnu.org/en/blog/2023/the-full-source-bootstrap-building-from-source-all-the-way-down/


Footnotes:

1.