proj-oot-ootAssemblyNotes9

An Efficient Implementation of SELF, a Dynamically-Typed Object-Oriented Language Based on Prototypes.

SELF push self onto the execution stack LITERAL <value index> push a literal value onto the execution stack SEND <message name index> send a message, popping the receiver and arguments off the execution stack and pushing the result SELF SEND <message name index> send a message to self, popping the arguments off the execution stack and pushing the result SUPER SEND <message name index> send a message to self, delegated to all parents, popping the arguments off the execution stack and pushing the result DELEGATEE <parent name index> delegate the next message send to the named parent NON-LOCAL RETURN execute a non-local return from the lexically-enclosing method activation INDEX-EXTENSION <index extension> extend the next index by prepending the index extension

"The byte codes needed to express SELF programs fall into only three classes: base values (LITERAL and SELF), message sends, and non-local return." ((and index extension, which is encoding))

SEND is sorta just CALL and a SELF or other object sorta just provides a "context" of mutable state for the call; consider self.x used to load the current value of instance variable x onto the stack. for oot, this suggests a system with both multi methods but also a single stateful/world 'context' for each call (by 'world' i refer to the undoable versionable state worlds)

are the other 'message' bytecodes needed? what exactly does the non-local return bytecode do?

"Smalltalk-80 defines a much larger set of byte codes [12], tuned to minimize space and maximize interpretation speed, and includes byte codes to fetch and store local, instance, class, pool, and global vari- ables and shortcut byte codes for common-case operations such as loading constants like nil, true, and 0. Smalltalk-80 systems also use special control flow byte codes to implement common boolean messages like ifTrue:ifFalse: and whileTrue:; the Small- talk parser translates the message sends into conditional and unconditional branch byte codes, open-coding the argument blocks. Similarly, the == message is auto- matically translated into the identity comparison primitive operation byte code. A similar optimization is included for messages like + and <, which the parser trans- lates into special byte codes. When executed, these byte codes either directly invoke the corresponding integer primitive operation (if the receiver is an integer), or perform the message send (if the receiver isn’t an integer). "

"Each byte code in the byte array represents a single byte-sized virtual machine instruction, and is divided into two parts: a 3-bit opcode and a 5-bit object array index. " eg for self.x, the five bits are the index of method 'x' within object 'self'.

i guess that's kinda like indirect indexed address mode ; you could have just loaded a literal first and then did a SEND instead of putting the object array index in the instruction; my split indirect indexed addressing mode is a more address-modey solution to the same desire to include all this in one instruction, except that with Self, the object instance address is on the stack, NOT in the instruction.

---

we can take some interesting cues from the RISC-V compressed extension set

https://en.wikipedia.org/wiki/RISC-V#cite_note-isa-3

https://riscv.org/wp-content/uploads/2015/11/riscv-compressed-spec-v1.9.pdf

RISC-V has at least 16 and (usually?) 32 integer registers and optionally, 32 floating point registers. Memory is addressed as 8-bit bytes. The Compressed instruction encoding has 16-bit instructions. Instructions tend to have 32-bit, 64-bit, and 128-bit variants. C.LI, Load Immediate, has a 6-bit immediate. C.J, PC-relative jump, has an 11-bit signed immediate (+-2k, which i think is +-1k instructions, since instructions are 16-bits). C.BEQZ has an 8-bit signed jump, or +-128 instructions. There is stack-offset addressing (meaning, read and write directly into the stack; there is no autoincrement, no PUSH/POP addressing). There are special instructions to add to/subtract a 6-bit quantity from the stack pointer in multiples of 16 bytes, and to get the stack pointer plus a 16-bit offset in multiples of 4 bytes. The latter suggests that (a) the stack pointer is usually 32-bit aligned, that is, on the stack are 32-bit integers, 64-bit integers and floats and larger, and pointers, (b) there are typically no more than 64 local variables (in which case you case get the address of any local variable on the stack with one of these special add instructions), or maybe 4-128 (in which case you can adjust the stack pointer in the prologue or epilogue with one of the special stack pointer mutation instructions). According to [1], the non-compressed ISA "was designed to permit position-independent code. It has a special instruction to generate 20 upper address bits that are relative to the program counter. The lower twelve bits are provided by normal loads, stores and jumps." -- in other words another instance of 12-bits being used for something.

This is interestingly rather close to what we have, except that we support 16-bit alignment instead of 32, and our instructions are 64-bits instead of 16- (and with a more uniform/simple encoding), (and of course we aren't concerning ourselves with bit-width variants of instructions; and we are CISC not RISC (we had addr modes, not LOAD/STORE)), we support 12-bit immediates instead of 6, and we are supporting 2k local variables instead of 64 (or maybe 128). Note that if we had 12-bit operands with split mode, we'd have 64 instead of 2k, which fits in well with RISC-V.

copying from proj-plbook-plChMipsDlxIsa section 'Summary of Risc-V Compressed (16-bit encoding) opcodes':

MOVs and loads and stores and LOADK:

Jumps and branches:

Other stack-pointer-related:

Arithmetic and boolean logic:

Misc:

with our indirect-immediate split mode, we could do both of the load/stores here (with either the stack register or the target register), except without the scaling. And if we also had a 'register-immediate' (take the value held in a register, and add an immediate, that's your value (NOT your effective address) then we could do a PC-relative jump but with only 6-bit range (and unscaled); we could do the branch but with 6-bit range (and unscaled). Ordinary addition on the stack pointer can do the two 'other stack-pointer-related' with our 3-address format and immediate and register addressing modes. Yes, we should have shifts, ADD, SUB, AND, OR, XOR, NOP, BREAK, and BAD.

So this suggests that we should have a split indirect-immediate mode for efficiency, even though we only get 6 bits for it. AND maybe a 'register immediate' split mode too (or even a 'PC plus immediate offset' non-split mode, to allow for 12-bit position-independent/PC-relative JMPs; or mb have a 4/8 split for a register immediate mode, to allow either the stack or the PC to be the base register for this).

actually, since JMP has two input arguments, if JMP just adds its two input arguments, then someone wanting to make a relative JMP with a 12-bit signed offset could just specify register direct mode for the first argument, and specify the PC for the register, and then specify immediate mode for the second argument. Which is what our current proposal has JMP doing anyways (in proj-oot-ootAssemblyNotes7). So i guess we're already done.

---

so it's interesting how little our ideas for oot assembly have changed since the proposal in proj-oot-ootAssemblyNotes7. The proposal initially had an encoding with 8 bits in the opcode operand, and 4 addr mode bits, and 2 flag bits; the current version has 12 bits in the opcode operand, and no flag bits, and 2 addr mode bits.

And we decided on 3 'optimization' addr modes that might be nice to hardcode instead of have as custom.

i'm currently feeling pretty good about this proposal, maybe it's finally stabilizing. We'll see, i've been wrong before.

Recall that the purpose of Ootasm is to make it easy to naively implement Oot on a new platform, by having an OotBytecode? that is really easy to implement on any platform, and bootstrapping the rest of Oot in this. At the same time, Ootasm is supposed to retain enough of the high-level semantics to allow interop features to be provided below the OotBytecode? interpreter runtime (eg by using Python dicts to implement Ootasm dicts, etc). Ootasm should make it easy to begin to optimize this implementation by swapping out parts of the Oot implementation for platform primitives (by having the Oot Bytecode interpreter replace calls to various custom instructions which are themselves written in crossplatform Ootasm with calls to platform-specific implementations using native platform primitives). Really advanced implementations of Oot on a given platform

I feel like the current OotBytecode? proposal would be pretty easy to implement in Python, for example.

---

still concerned about whether we can have relative jumps in conditional branches. In the current proposal we only have skips and absolute JMPs. The idea behind that was, as we inline-expand custom instructions, the relative spacing between instructions will expand, so (a) things that might be within relative jump range in the source code might be pushed out of it after inline expansions, and (b) this would be complex or inefficient to make the compiler rewrite affected relative jumps between each expansion, so the compiler is going to just assign each relative jump a label anyways and then resolve the labels once after all inline expansions.

But we only have room for 4k jump labels in each module's constant table. This isn't enough for all branches in the code. Which means that an inlining compiler must maintain a much bigger jump table, which means that (a) intermediate compilation stages wouldn't be valid oot, and (b) in the end the jumps are going to have to be PC-relative, rather than absolute, anyways, and (b) also implies (c) currently, PC-relative jumps are limits to 12 bits (maybe we want to add a JMPREL (relative jump) instruction, which is implicitly PC-relative, so the jump offset has 24 bits instead of just 12? 24 bits is 16 MB, which is not huge by conventional standards, but perhaps an okay largest size for a naively-compiled Oot compilation unit (i say 'naively compiled' because a really fancy compiler could split one Oot module into many Ootasm modules).

Seems to me (but i havent thought hard about it) that if, in all source code, near jumps are limited to an offset of 12 bits, then we can do up to 12 inlining steps and still guarantee that the end result will be within an offset of 24 bits.

Which means that if we limit the Oot Custom Instruction 'call stack' to a depth of 8 (or even 12; but 8 gives us some breathing room), we're good to go. We just specify that (a) the Ootasm 'runtime' must be able to handle the 'JMPREL' opcode, but Ootasm 'source code' must never contain it.

(although i guess the compiler could also just STOP INLINING if the call stack gets deeper than that; but since this isn't a normal stack it would be nice to give it a predefined limit anyways).

---

list of 'magic' numbers/file format signatures https://en.m.wikipedia.org/wiki/List_of_file_signatures

---

in a way Oot 'Bytecode' is a misnomer, since we have 12 bits of opcode, not just 8 (that being said, the primitive ops are limited to 8 bits).

---

i think it's time to write an updated draft of the current proposal (currently found in proj-oot-ootAssemblyNotes7). Not much has changed, but i bet the second draft will be better.

---

yknow after writing the 'motivation' section below, actually i'm not so sure that we can get by without 'core oot'. The motivation below stresses that one of the reasons for this bytecode instead of others is so that it is high-level enough to produce readable code when compiled into an HLL. But if bytecode is flat/linear, then how is that true? Sure, we can reconstruct expressions via tracing use of the implicit stack, but that seems kind of backwards.

Maybe we should just have an AST and walk/execute the AST. This is probably less efficient than linear bytecode but the point here was to make a naive implementation, not an efficient one.

Anyway, didn't i read that Perl5 executed the AST, or maybe that was Ruby? (i think Perl6 does have bytecode though, ie MoarVM?)

https://www.reddit.com/r/learnprogramming/comments/w6i0p/bytecode_execution_vs_ast_walking_speed/c5arute gives a simple example of AST walking:

" Consider the trivial program

x = 3 + 2

It produces an AST along the lines of

ast = { op_code: OP_ASSIGN, name : 'x', expr : { op_code : OP_ADD, lhs: { op_code: OP_LITERAL, value: 3 }, rhs: { op_code: OP_LITERAL, value: 2 } } }

So an evaluation loop for the trivial language would something like the following, using a Pythonish pseudocode.

op_environ = {} op_stack = [] result_stack = [] op_stack.push(ast) while op_stack: op = op_stack.pop() switch op.op_code: case OP_ADD: op_stack.push(AUX_OP_ADD_END) op_stack.push(op.rhs) op_stack.push(op.lhs) case AUX_OP_ADD_END: rhs_result = result_stack.pop() lhs_result = result_stack.pop() result_stack.push(lhs + rhs) case OP_ASSIGN: op_stack.push(AUX_OP_ASSIGN_END) result_stack.push(op.name) op_stack.push(op.expr) case AUX_OP_ASSIGN_END: value = result_stack.pop() name = result_stack.pop() op_environ[name] = value case OP_LITERAL: result_stack.push(op.value)

yada... yada... yada... " (my note: i changed OP_ASSIGN_END and OP_ADD_END to AUX_OP_ASSIGN_END and AUX_OP_ADD_END)

as you can see, the AST walker implementation is not harder to implement or to read and can be organized into a big SWITCH just like bytecode. The 'implicit stack' is very explict here and replaces our 'stack addressing' within expressions (which is nice, because it makes a clear separation between the 'implicit stack', which is invisible to the code and is an implementation detail of the AST walker, and the stack which is visible to the code).

it kind of reminds me of an SECD machine. Of course, we would add partial application stuff to our implementation, bringing it closer to an SECD machine.

Now what about compilation to more serial target languages like assembly (or maybe BASIC, which allow arithmetic expressions within lines of code, but not blocks of code such as while loops)? Now porting Oot to these is a little harder because the person doing the port must do an additional compile step to replace subexpressions with a sequence of instructions using an in-program stack. But since this is the same for every target language, we could provide a module to do this for them.

Or, maybe we should just include parens in the bytecode? But i guess that is basically a serialized AST...

So i guess what i am proposing is a unificiation of Oot Core and Oot Bytecode. This is a 'bottom-up' way of designing Oot Core. The notes files proj-oot-ootCore, proj-oot-ootCoreThoughts, proj-oot-ootCoreNotes1, proj-oot-ootCoreNotes2 contain a more 'top-down' approach, which will eventually have to be merged with this one.

If we unified Oot Bytecode and Oot Core, then the subset of 'primitive operations' in Oot Bytecode becomes another, smaller core language within Oot Core; it might be called Oot CoreCore? (a less silly name would be Oot Core Primitives, but i think Oot Core Primitives is harder to keep straight than Oot CoreCore? (ie i think ppl would confuse 'Oot Core' with 'Oot Core Primitives' but no one will confuse Oot Core with Oot CoreCore?).

So i guess all this thinking about the Oot Bytecode encoding will eventually become just a fancy, well-though out AST serialization format. I think this way of thinking about it is somewhat valuable, however, as it allowed me to get a good feeling for how assembly-language constructs map onto the AST. As i've seen, assembly language is more determined by encoding constraints than i would have thought, so thinking about these constraints and designing an assembly-language-like AST serialization format with similar constaints first probably helps in making Oot Core something that will be more easy to practically implement on a variety of platforms.

Now, what happens to our stack addressing mode if programs are an AST tree? I guess it stays (we can use it for the call stack and/or for a data stack holding local variables and/or passing parameters and return values across function calls). And how do we actually encode subexpressions? A naive AST in a language like Python would probably just give each node an arbitrary-length list of references to child nodes, so at the low-level a node would probably be represented in memory as objects which hold a pointers to a dynamically-sized list of pointers. But our Oot Bytecode was designed so that each instruction had exactly 3 operands. So instead of variable-length lists, each node is a fixed-size struct (64-bits, to be exact).

Note that immediately-specified conditionals and jumps in our current scheme uses 12-bit relative offsets or absolute node IDs (locations in the code stream), as opposed to an AST which would probably use native pointers between nodes.

Currently, the way you'd express the expression "x + y" would be a sequence of statements, PUSH x; PUSH y; + (stack) (stack);. But an AST expresses it via a node + that has pointers to the nodes for "x" and for "y". So how do we update our stack addressing mode to do this? We could add an additional SUBEXPR addr mode for subexpressions. Instead of native pointers, each node could hold a relative offset to other nodes.

Another thing about executable ASTs is that they seem to tend towards a LISP-y or Haskell-y idea of everything-is-an-expression. By contrast, one thing about assembly and flat bytecode is that it tends towards serial sequences of statements. How would a sequence of statements be represented in an AST? Possibly either as a 'block' node with arbitrarily many children (like Lisp PROGN), or with each node having a 'successor' pointer. But we'd rather do it the assembly way; with sequences just being successive entries in the flat array of statements.

In other words here's how i see this working. Consider encoding something like the following:

  x = 2
  y = 3
  z = x + y

it would (still) be:

CPY x, 2 # where X is a register, and 2 is immediate 2 CPY y, 3 ADD z, x, y

whereas

  x = 2
  y = 3
  z = 5*x + 7*y

would be:

CPY x, 2 CPY y, 3 ADD z, *SUBEX1, *SUBEX2 # where * is 'subexpression addressing mode', and SUBEX1 represents the relative distance to the SUBEX1 label J CONTINUE ( SUBEX1: MUL RESULT*, 5, x # where postfix * is 'result of subexpr addr mode' ) ( SUBEX2: MUL RESULT*, 7, y ) CONTINUE: ...

(note: i guess you don't really need the '('s in the instruction stream, only the ')'s; and in fact not even those, as the subexpression terminates when RESULT* is encountered; note that this would mean that it must be verified that every control flow path through the subexpression terimates with a RESULT*; to avoid having to do that verification, you could just include the ')'s as a fallthru default (if one is encountered, you return NIL))

(note: 'RESULT*' could just be '0' in subexpr addr mode)

(note: in a strict language, you'd always need to execute SUBEX1 and SUBEX2 before the ADD, so you'd want to put them before the ADD and execute them upon encountering them, postfix style; but in a lazy language, you need to see the ADD first. So this way is more general, although more inefficient, since to execute SUBEX1 and SUBEX2 you're jumping all over the place (which is how the AST tree walking always works, however; it encounters the top-level of an expression first, then expands it top-down). If you want more efficient execution of small subexpressions, could use SHORT encoding)

So, this is a pretty neat thing we're contemplating; kind of a superpowered assembly that can include expressions and has a 'subexpression addressing mode', or a superpowered AST that can include linear sequences without the overhead of a 'next' pointer in every node, with relative addressing instead of native pointers, and with a concise fixed-length node format.

Also, there is a unique obvious linearization of this, so if you do need to convert it into flat bytecode, there is an obvious way to do that (convert the subexpression addr mode into stack ops).

note: for this reason, register '1' might want to be reserved! (so that it can be used as the stack pointer for the implicit stack in linear bytecode)

note: so is non-local control flow disallowed within subexpressions? Probably not; since in languages like eg Scheme, everything is a subexpression, this would make it hard to allow continuations. Also, you could always CALL something from within a subexpression that generates an exception that you have to bubble up past the original call.

if non-local control flow is allowed, then you might want to expose the 'implicit' stack to the program so that it can eg save it if it is serializing its state when a continuation is reached. Which means that you may just want to combine it with the data stack.

i'm having second thoughts about whether this is precisely Oot Core, though. Does Oot Core really address conditionals in terms of 12-bit relative branches? It seems to me that Oot Core is more of an ideal data model for an AST, possibly with some attached constraints, and that this bytecode is just one representation of that data model, such that the physical constraints of this representation are reflected in the constraints attached to the Oot Core specification. Also, should those constraints even be there? Philosophically, the idea that "2k functions is enough for one module" and "2k local variables is enough for one function" sound similar to "2 bytes is enough for one number"; but i thought we wanted arbitrary-precision arithmetic at the Oot Core level. So maybe these limitations should apply to the bytecode format but not to Oot Core, in which case there is a distinction between them.

More glaringly, Oot Bytecode doesn't assume that the platform has garbage collection and schedulers, and so provides these, although they can be swapped out; so there may be explicit calls to GC and scheduler stuff all over the place, right? But Oot Core probably just assumes these things. All of which suggest that Oot Assembly is, actually, different than Oot Core.


another issue: shouldn't the interpreter implicitly perform read and write barriers in the heap for us when we access the heap through the ordinary addressing modes? although it can call Oot Bytecode Stdlib routines to do this.


Another issue: What about polymorphism? We've kind of swept that under the rug. The reasoning for having just ADD, rather than ADD8, ADD16, ADD32, ADD64, UADD8, ADD_RATIONAL, etc, was that (a) we want to use the numeric types of whatever platform Oot is running on, so we can't guarantee that any particular type of number is available, (b) sometimes we really do need late bound polymorphic ADD (eg when ADDing OOP instances), and (c) it's simpler; if Lua can do that, why not us?

But this means that at runtime a type tag is stored for each value, or that at compile time the program is being compiled down to a level below Oot Bytecode, a level at which ADD is resolved into things ADD8, with type annotations in the Oot Bytecode being used to statically choose, or that Oot Bytecode contains a prefix annotation before each ADD saying what the type is, which would effectively make ADD instructions 128 bits, not 64 (or i suppose we could have a 'default' type for each operation (eg INT16 for ADD) and prefix instructions for others).

Also, the HLL might say 'i promise this value is always < 216', in which case 16-bit integers can be used; but the HLL might also say, 'i need wraparound at 216!' in which case 16-bit integers MUST be used, or at least emulated.


re: polymorphism; i guess the idea is that we do the usual thing (ADD32, ADD64, UADD32, etc; that is, a different opcode for each specialized method), except that the primitive ADD (or UADD0 or something like that) isn't any of these; UADD015:

If order to avoid making assumptions about what happens upon overflow, the Oot implementation promises not to call UADD015 unless it can guarantee that the result will be < 2^15. Either it is known ahead of time that the result is < 2^15, or USUB015 is first used to check if ((x < 2^15 - 1 - y) AND (y < 2^15 - 1 - x))

Super-slow, yes, but also makes it super-easy for the implementer to get their first naive implementation running on a new platform. Later on, more advanced implementations can override ADD32, etc, with platform primitives.

hmm otoh maybe the interpreter should support a polymorphic ADD instruction. This would leave open the possibility of doing dynamic dispatch on boxed values. On primitive values, the interpreter would consult its stack/memory type maps to see what type something is, and then dispatch to that. This could be done at runtime; or the interpreter could specialize the ADD at load time based on the stack/memory map.

The advantage here is it allows generic code to be written (eg you can have an ADD function that you can pass to MAP that works on a list no matter whether the numbers to be added are 015, 16-bit, 32-bit, etc). The disadvantage is slowness, unless the interpreter bothers to consult its memory maps and pre-specialize.

Let's do it.

So, the Oot Bytecode implementer still only has to implement UADD015, and the dynamic dispatch in ADD. But when they implement the interpreter, they have to deal with the dynamic dispatch.

Maybe we should also make a rule that a directly-addressable memory location cannot change its type during one function; although it can be of type DYN.

Maybe type maps should have space for 2k per-module types, plus 2k type variables (used in generic functions).

---

why not support 8-bit, eg UADD07? Because

even on systems which don't have native instructions for adding and subtracting 16-bit numbers, it's often easy to do so. Eg, on the 6502 [2]:

  SEC         ; RESULT = NUM1 - NUM2
  LDA NUM1L   ; After the SBC NUM2H instruction:
  SBC NUM2L   ;   V = 0 if -32768 <= RESULT <= 32767
  STA RESULTL ;   V = 1 if RESULT < -32768 or RESULT > 32767
  LDA NUM1H
  SBC NUM2H
  STA RESULTH

---

the primitive operations are not truly minimal; for example, we have a USUB0 operation even though you could build subtraction out of addition and conditional branching. Another example: if you have branch-if-zero and jumps you can make branch-if-nonzero.

hmm otoh mb we SHOULD make it more minimal just to drive home the point that implementors really really should make a not-100%-naive implementation ASAP.

---

another random note: i guess ideally LONG format would not be just a random semantic-less serialization format for future DSLs and languages, but rather a certain way of expressing predefined concepts such as prefix instructions and addressing mode flags which has a fixed semantics in terms of more unweildy MEDIUM-format code constructs. In other words, i want to give LONG a semantics in terms of specifying how to automatically compile any LONG code into MEDIUM code.

---

we need a way to pass simple instructions as 'first-class functions', too!

(one idea is to make a 'builtins' module which resides at constant 1 of every constant table, but i prefer the next idea:

another idea, which i currently like best, is to just interpret immediate-mode arguments given where a function is expected as opcodes for instructions

(eg if opcode 4 is CPY, and the first argument of a MAP instruction expects a function, then by giving immediate-mode 4 for that argument, CPY is indicated as the fn).

---

old verbiage:

Abstract

Oot Bytecode is a simple, easy to implement virtual machine. Oot Core is implemented in Oot Bytecode. This makes it easy to bootstrap a naive implementation of Oot; all you have to do is to implement Oot Bytecode.

Terminology

Oot Assembly is a textual language which corresponds to a binary representation called Oot Bytecode. Oot Bytecode consists of a very small and easy to implement core language (Oot Bytecode Core), and the Oot Bytecode Standard Library, which is itself written in the Oot Bytecode core language.

Motivation

In order to port Oot to a new platform, all you have to do is to implement an interpreter or compiler of Oot Bytecode Core (plus whichever OS services, such as file access, you want to be able to use from Oot). The reference implementation of Oot Core is provided in Oot Bytecode, and the reference implementation of Oot can be bootstrapped from Oot Core.

Why did we create a new language Oot Core?

Existing popular target languages were judged to be too laborious to implement or too low-level. But why do we require our target language to be easy-to-implement and high-ish-level?

Oot is intended to be a cross-platform 'glue language' and 'scripting language' that can be compiled not only into various lower-level target platforms (for examples, GNU/Linux, MacOS?, Windows, Android, iOS, JVM, CLR, Webassembly, LLVM), but also into various other high-level languages (HLLs) (for example, Java, Javascript, C, Python). It is intended to be able to smoothly interoperate with code written in these language (to call and be called by foreign code, to pass data to and from foreign code, and to be compilable to reasonably readable foreign code); the idea is that if you already have a project in some other language, but there is a library in Oot that you would like to use, then you should be able to extend your project by scripting in Oot; and if you already have a library in some other language but you want to write a project in Oot, then you should be able to use that library from Oot.

Because Oot wants to live on many different platforms, Oot's implementation must be easy to port. Furthermore, for the sake of interoperability, Oot should be able to run 'on top of' various other platforms, making use of the host platform's data types (such as strings, lists) and services (such as garbage collection and greenthreads).

The idea is that an interpreter or compiler and runtime for Oot Bytecode Core should be really easy to implement naively on a new platform. Many existing target languages are too large to be really quick to port.

The Oot Bytecode standard library then uses the Oot Bytecode Core primitives to implement basic data structures, control flow, and services such as lists, while-loops, garbage collection, greenthreads etc. At this point, the implementer can bootstrap a naive implementation of Oot on the new platform.

Later, in order to provide better interoperability or performance, the implementer can incrementally 'swap out' the default Oot Bytecode standard library implementations of things like lists, while-loops, garbage collection, greenthreads, etc, for platform-native implementations.

So, we want almost all of the data types, control flow, and services that the host platform might provide to be able to live 'below' the Oot Bytecode layer if desired, being implemented by the Oot Bytecode implementation using native platform primitives (even though, in an initial naive Oot Bytecode implementation, these things are provided by the Oot Bytecode Standard library). If we just compiled everything to, say, LLVM, then, say, while-loops would no longer be easy to recognize when expressed as LLVM assembly. This would make it harder to compile Oot to readable code in a target HLL later on. Instead, we want a while-loop to retain its identity as a while-loop. We could compile to LLVM and introduce a complex system of annotations on top of the LLVM bytecode, but that's almost as much work as creating Oot Bytecode.

Why not just implement Oot Core directly? Really advanced Oot implementations on some platforms could bypass Oot Bytecode altogether and provide an interpreter or compiler directly from Oot Core to the host platform; that will probably yield the best results. But that's slightly more work, and we want to make it really easy for people to quickly write naive ports even when they don't have the time or expertise to implement Oot Core.

---

should we mb have 2k memory locations instead of 4k, allowing the interpreter to use the other 2k for something else?

also limit depth of 'subexpression' address mode implicit nesting to 2k

---

so i guess each module must have a polymorphic function lookup table for functions that it provides?

---

todo: update the following for the AST walking stuff

---

RISC-V's memory model looks pretty simple, mb we should use it. It's 'relaxed', with a two FENCE instructions: FENCE (for data) and FENCE-I (flush instruction cache).

FENCE is parameterized by 8 bits, P(I,O,R,W), S(I,O,R,W) which indicate a 'predecessor set' and 'successor set' of operation types. "no other RISC-V thread or external device can observe any operation in the successor set following a FENCE before any operation in the predecessor set preceding the FENCE". Whether a read/write is considered R/W or I/O is implementation-dependent.

[3]

---

RISC-V places their FORMAT bits in LSB, not MSB, so maybe we should do the same:

" We chose little-endian byte ordering for the RISC-V memory system because little-endian sys- tems are currently dominant commercially.... We have to fix the order in which instruction parcels are stored in memory, independent of memory system endianness, to ensure that the length-encoding bits always appear rst in halfword address order. This allows the length of a variable-length instruction to be quickly determined by an instruction fetch unit by examining only the rst few bits of the rst 16-bit instruction parcel. Once we had decided to fix on a little-endian memory system and instruction parcel ordering, this naturally led to placing the length-encoding bits in the LSB positions of the instruction format to avoid breaking up opcode elds. " [4]

RISC-V encodes sign with the MSB, i think:

"Sign-extension is one of the most critical operations on immediates (particularly in RV64I), and in RISC-V the sign bit for all immediates is always held in bit 31 of the instruction to allow sign-extension to proceed in parallel with instruction decoding."

---

here's what RISC-V says about overflow:

" We did not include special instruction set support for over ow checks on integer arithmetic operations, as many over ow checks can be cheaply implemented using RISC-V branches. Over- ow checking for unsigned addition requires only a single additional branch instruction after the addition. Similarly, signed array bounds checking requires only a single branch instruction. Over ow checks for signed addition require several instructions depending on whether the ad- dend is an immediate or a variable. We considered adding branches that test if the sum of their signed register operands would over ow, but ultimately chose to omit these from the base ISA. "

however [5] says this is a terrible idea and that HLL performance is dominated by overflow checks.

---

i guess our 'types' are more like Haskell typeclasses, because we can have 'abstract' ones like INT?

---

(text of current proposal moved to ootAssemblyThoughts)


Isn't it really slow to check all the time if ((x < 2^15 - 1 - y) AND (y < 2^15 - 1 - x))?

Yeah, but this allows the implementer to just call the platform primitive in UADD015 without doing extra work to communicate a detected overflow using Oot's conventions. This makes it dead easy to get an initial naive port running. In addition, often we will know ahead of time anyways that the result is < 2^15, in which case we just want to do the addition without extra unnecessary work. When we don't know ahead of time, Oot code will be using higher-level functions like UADD16 and ADD16, which almost all implementations will be overriding and implementing using platform primitives anyways; only in the most naive implementations will the default implementations, which do these checks via subtraction, be used.

(also, it may not be that inefficient; i think there was even an ISA that took a similar approach; was it RISC-V?)

---

Why only 2k directly-accessible memory locs, rather than 4k, since immediate memory addresses are 12 bits?

B/c the other 2k immediate address values are reserved for use by the implementation in intermediate representations.

---

i guess we don't strictly need prefix modifiers because we can have custom instructions that examine the next instruction and execute it in a wrapper

---

has anyone else tried to make a polymorphic assembly language? Yes, apparently "TALx86". Should read about it mb:

https://www.cs.cornell.edu/talc/papers/talx86-wcsss.pdf

---

[6] agrees with me that "Interop between languages needs to happen at a level far higher than abstract assembly to get any meaningful benefit. "

---

design choice commentary:

It's kinda weird that we have limits like 2k 'virtual registers', and all these other 2k limits. Many programming languages have 64k limits. LLVM has infinite virtual registers (and we're supposed to be 'higher level' than LLVM!). Why didn't we do that?

What LLVM does is it has a scheme for representing arbitrarily-large integers. At first i wanted to do that too, but i decided not to, because (a) such a scheme is more complicated to implement, and (b) besides the effort spent directly in implementing this, this means that other parts of the implementation, for example any program analysis, cannot assume an upper bound on how many registers there are; so you cannot for example have a 2-byte integer 'register label' anywhere in your implementation, because what if more than 2^16 registers are used? Since one of our primary goals is that this be easy to port, via making it easy to implement, i judged that this additional implementation complexity was not worth it.

Now, why not 64k limits instead of 2k? If we had 64k limits with our 4-address format, then we'd have 64 bits per instruction just to hold the operands. If we want to 64-bit align our instructions, that's 128 bits per instruction, which is much higher than other bytecodes (even 64 bits is higher than even most 'high level' bytecodes). And, whether or not we jump all the way to 128 bits, most of the time most of the 64 operand bits will just be 0s, because it will be very rare that any of these values goes above 2k. I just couldn't stand that amount of waste, even though this is premature optimization; and since we can get close to 64-bits, i think it is cleaner to stick close to a power of 2 (and there is probably some efficiency benefits to that as well).

2k is still much larger than most other assembly languages (and most of those operand bits will still be 0s), and yet much smaller than the 64k limits on most HLLs. Is this a Goldilocks perfect compromise, or a useless system that tries to serve multiple use-cases and ends up doing neither well? We'll see.

---

One other thing we could do is to also provide a compiler from the textual Oot Assembly into Oot Bytecode which removes most of the limits; eg it internally uses structs in functions which have more than 2k locals, and breaks apart modules which have more than 2k functions or constants.

---

"LLVM is a simpler language to target than traditional assembly languages—it has nice features like infinite registers and recently support for stack frames. " [7]

---

"Rust compiles into LLVM because anything higher-level wouldn't provide sufficient control over the processor and memory." [8]

"a lot of languages that compile to assembly do so because it has the fewest opinions on how your program should run" [9]

---

[10] advocates a Rust compile-target for interoperability:

" 1. Rust has an advanced type system. Interoperability between languages is limited because often times the least common denominator between two programming languages is at the assembly level...

2. Rust has several useful programming constructs that don't exist in LLVM. A lot of ideas we take for granted in our programming languages like loops, closures, and enums (sum types), have to be re-implemented every time a new language compiles to LLVM. By compiling a language to Rust, not only do you get access to these constructs, but in my mind it's generally far easier to write the code generator than generating LLVM by hand, particularly with quasiquotations.

3. Rust comes with a package manager, Cargo...If new languages compiled to Rust, not only would they get a package manager for free, but they also get free access to programs written in every other language that compiles to Rust.

4. If new languages compiled to Rust, not only would they get a package manager for free, but they also get free access to programs written in every other language that compiles to Rust....This means writing a language that compiles to Rust gets all existing Rust libraries potentially for free—no need to rewrite HTTP servers or command-line parsing for the millionth time.

5. Rust only imposes one opinion on your program: that it should be safe. I would argue that the only real difference at the runtime level between programs you can write in LLVM and programs you can write in Rust is that Rust requires all of your programs to be safe, i.e. not have memory errors/data races. Rust does not require you to have dynamic types, garbage collection, or really any particular runtime overhead—all of the hard work is done at compile time. "

for Oot, though, having Rust be the only compile target is insufficient, b/c afaict Rust only target what LLVM can target [11], which probably doesn't include transpiliation to Lua etc, at least not without losing HLL constructs (also i think writing a new LLVM backend is probably too complicated).

i guess it should be ONE of our compile targets though. If Rust becomes the new C then it will become the default interop compile target.

---

Textual assembly format

All i know is that (a) it'll mostly look like assembly (but are there subexpressions, blocks?), and (b) there will be some compile-time macros and possibly other compile-time metaprogramming stuff

I guess insofar as there are a bunch of less important/bikesheddy syntax decisions, we should follow the syntax of one or more existing popular assembly-like textual formats. Let's survey what these are:

In summary, the more commonly mentioned ones (plus newer standards that i think we should consider) are:

out of these, the most common, plus ones i'm forcibly including as above, are:

excluded ones syntax notes:

vasm:

    Standard MIT (GNU-as style) syntax
    Motorola/Freescale 68k syntax (Devpac compatible)
    Atari MadMac syntax (6502, 68k, Jaguar)
    Old 8-bit style syntax

http://wayback.archive.org/web/20160307203948/http://staff.mmcs.sfedu.ru/~ulysses/Edu/MP/WhichAsm.html gives the following syntax similarity list:

    MASM
    TASM
    
    FASM
    NASM
    A86/A386
    GoASM
    Gas/.intel_syntax mode
    
    RosAsm
    
    HLA
    
    Gas/.att syntax mode
    Terse 

"Of these assemblers, only MASM and TASM provide any source code compatibilty whatsoever at all. Specifically, TASM is capable of assembling most MASM source files without modification (the reverse isn't always true, as TASM has a few more features and a specialized "ideal syntax" mode that exists primarily to let TASM users write source code that is incompatible with MASM).

The FASM, NASM, A86/A386, GoASM?, and Gas (with .intel_syntax option active) assemblers are not at all compatible, but translation between these assemblers is mostly a mechanical task (that is, substituting one string directly for another throughout the source file with only minimal interpretation on the part of the translator). The assemblers sitting in gaps by themselves (RosAsm?, HLA, and Gas with the .att syntax option) generally require quite a bit of work to translate source code from some other assembler to these assemblers.

HLA and Terse are special cases. HLA and Terse suffer from the consumer problem insofar as their syntax is sufficiently different from the other assemblers that it will take a bit of effort to convert that other source code to HLA or (especially) the Terse format. However, these assemblers solve the producer problem in a novel fashion - HLA provides an option to translate HLA source code to that used by several other assemblers (as this is being written, HLA provides the ability to translate HLA source code into MASM, TASM, Gas, or FASM format (NASM is planned and being worked on while this is being written). Similarly, Terse translates it source code into MASM or Gas form. Therefore, should you need to translate HLA or Terse source code to one of these other formats, the assembler does much of the basic work for you. No other assemblers provide this novel feature. "

"Technical articles and mini-tutorials, however, are another matter. Anyone can crank out a few pages of text that describes some programming procedure or other concept. As such, you'll find the playing field a little bit more level (i.e., not totally skewed towards HLA, MASM, and NASM) in this area."

" MASM is, without question, the most popular assembler so it is not surprising that the largest number of different articles and web pages describing some little feature are written around MASM. NASM, probably being the second most popular assembler comes in second. HLA has a tremendous number of articles and related information here on Webster, though the number of HLA-related technical articles at other websites is no better than other assemblers. There are several sites dedicated to Gas, especially with respect to programming under Linux in assembly language. So Gas probably comes in third (it's a real shame that these sites don't provide good Gas documentation as well, something that's sorely needed by Gas users and wannabe Gas users). "

" Most of the technical articles out there describe how to use a given assembler to make system calls to a specific operating system. For example, the excellent tutorials by Iczelion are a set of over 30 technical articles with MASM32 source code describing how to write Windows GUI applications. These tutorials are so popular that many of them have been translated from MASM32 to other assemblers (e.g., NASM, SpAsm?/RosAsm?, and HLA). There are similar papers available describing Linux system calls with assemblers such as NASM, Gas, and HLA. "

"

Some people are only comfortable going with a product that has been well accepted by the marketplace. A few years ago it was very easy to state which assemblers were the most popular: they were MASM and TASM (in that order) and then all the other assemblers made up a small fraction of the remaining assembler users. TASM, however, died off in the late 1990's (and Borland stopped supporting it) so it rapidly fell out of favor. Though Microsoft still supports MASM, they stopped selling it as a commercial product in the late 1990's, and its popularity began waning (once people have to look for an assembler, rather than just grabbing one off the shelf at their local software store, they tend to discover all the other tools that are available and often choose a product better suited to their tastes).

MASM probably has 80% of the market share today in x86 assemblers (at least, as this was being written). Part of this is momentum (MASM was the best choice in assemblers back in the days of DOS). However, Hutch's MASM32 package and the Iczelion tutorials have rekindled interest in MASM programming under Windows. Though Microsoft's support for MASM is very low key and interest in MASM is waning (for political as well as technical reasons), it will be many years before any other assembler comes close to MASM's popularity.

In second place, undoubtedly, is the NASM assembler. NASM began as a "software community" programming project in the middle 1990's. The goal of NASM was to provide a free "Intel syntax" assembler as an alternative to MASM (Gas was available and free at the time, but it uses the AT&T syntax which most x86 assembly programmers dislike). "

"NASM is another example of a popular product whose fortunes have faded a bit because of lack of recent support. As noted earlier, many people have defected from NASM to FASM because of the perception that NASM is no longer being supported.

Based on announcements in various assembly language language newsgroups and web pages, the FASM, RosAsm?, and HLA assemblers are the ones that get updated most frequently (probably averaging about one update per month). Users of these assemblers can usually expect to have problems dealt with in a timely fashion (unlike some of the other products that rarely get updated).

of the forums I regularly visit, MASM32 is very active, FASM is active, and HLA is active. "

"

Currently, here are the assemblers that have source code available and their source code format and license:

    HLA, C/Flex/Bison, public domain
    Gas, C/Flex/Bison, GPL
    NASM, C, LGPL
    RosAsm, assembly (RosAsm), GPL
    FASM, assembly (FASM), freeware
    "
    

" The FASM assembler has generated a lot of interest recently as the "heir-apparent" to NASM (though another product, YASM, is now making that claim, too). FASM is much faster than NASM (FASM is written in assembly, NASM in C) and FASM generates much better object code than NASM because it automatically optimizes displacements (e.g., jumps). FASM also provides the ability to produce executable files in a single step, without using a linker, that further speeds up the development process. As this is being written, there is considerable activity centered around FASM in various newsgroups and on the Win32Asm community board. FASM is syntactically similar to NASM, so conversion from NASM to FASM is very simple. As such, FASM is siphoning off a lot of NASM users who are dissatisfied with the progress being made on the NASM assembler (new releases of NASM are few and far inbetween). Though FASM has fewer features than NASM, FASM's development is continuing and many programmers see FASM as NASM's ultimate successor. "

"HLA (the High Level Assembler) has recently seen a huge spike in use and interest"

" Another source of assembly related information can be found in many of the "I'm writing my own Operating System" web pages out there. Such articles tend to favor MASM and NASM (these seem to be the most common choices for those who do OS development in assembly). "

https://en.wikipedia.org/wiki/High-level_assembler lists TASM, NASM, MASM, HLASM, Linoleum Ziron, HLA.

random guy likes FASM: http://www.programmersheaven.com/discussion/346161/whats-the-best-assembler-compiler

"Though MASM and TASM provide sophisticated features, the drawback is that you often have to learn and use these sophisticated features in order to write simple assembly language programs"

"The place where assemblers vary greatly is with respect to the selection of pseudo-opcodes, assembler directives, macros and compile-time language facilities, and design philosophy. Many of the newer assemblers, for example, provide far weaker support for advanced features like macros than do more mature assemblers like MASM and TASM."

"assemblers like MASM ((and HLA, he says in the next paragraph)) do some limited type checking on their operands as a minimalistic "sanity check". "

"MASM sets the standard for a full-featured assembler. We can easily divide the other assemblers into two camps - those that are less powerful than MASM and those that are more powerful than MASM. Overall, we can state that TASM and HLA are more powerful than MASM, while all the other assemblers are less powerful. However, it's difficult to make such a sweeping generalization because if you fixate on one particular feature, a given assembler may do a better job with that feature than MASM (or some other assembler). Therefore, we'll take a look at some of the important features that assemblers support."

"Most assemblers out there provide byte, word, dword, qword, tbyte, and floating point data types (32, 64, and 80 bit floats). ...MASM took the early lead in this area, providing STRUCTs, UNIONs, and a TYPEDEF statement that lets programmers create their own data types. TASM took this one step farther by adding support for CLASSes in assembly language. HLA took the concept of user-defined types to a new level with its TYPE declaration section plus support for sets, thunks, and other advanced data types.

Users of other assemblers have discovered the problems with the lack of sophisticated data typing facilities in their assemblers. For example, NASM has added a pseudo-structure to the language because of the problems of not having structures in Win32 assembly programming. Other assemblers offer half-hearted measures (e.g., using macros to attempt to simulate structures) with varying degrees of success.

One big advantage of the HLA language, when it comes to data types, is that it uses a syntax that is quite similar to high level languages like C/C++ and Pascal/Delphi when defining data types. Though HLA's syntax seems strange to people who only know assembly language, those who work in high level languages as well as assembly language generally find HLA's syntax far more readable in the data declaration sections. "

" MASM also set the standard for macro, conditional assembly, and compile-time language facilities. Only HLA, which was created specifically because MASM's macros were insufficient to support the 32-bit edition of "The Art of Assembly Language Programming" exceeds MASM's capabilities in this area.

NASM and FASM also have an interesting feature in its macro facility - the ability to save and restore macro "contexts". This provides the ability (with a bit of work) to simulate a facility in HLA known as context-free macros. Without getting into the technical details, context-free macros give you the ability to create nestable control structures like if..endif or while..endwhile. NASM's users, for example, use this facility to create macros to simulate IF..ENDIF statements that are nestable (unlike, say, the kludge that RosAsm? uses to attempt to simulate the same thing).

HLA (without question the most powerful macro and compile-time language facilities)

1

MASM/TASM

2

FASM

3

NASM

4

Gas, RosAsm?, others

"

"FASM?s macro abilities are better than the ones of NASM" [12]

"

Don't excatly know what NASM can do, but this may give you a hint of fasm's marco syntax:

http://www.decard.net/?body=tajga&chapter=preproc http://flatassembler.net/docs.php?article=manual#2.3.3 http://board.flatassembler.net/forum.php?f=14 - Look in some threads..

...

same for nasm: http://nasm.sourceforge.net/doc/html/nasmdoc4.html#section-4.3.2

The only thing i cannot find in nasm is the "%var in <eax,edx,ecx,ebx>" thing. Dunno how often one uses this, though. "

" I found some features very useful, compared to NASM : anonymous labels, macro support, and computing on symbols. For example of the last, you can define a label and then compute things like Code: ((my_lbl shr 12) and 0xFF00) or 0x3 which is very useful to construct tables, like GDT and IDT without the need of construction code.

" Maybe you could elaborate a bit more on that one, but NASM had a lot of features concerning labels, among other local labels (e.g. you can have .loop at several places provided they are in different functions), macro-local labels, etc.

Quote: macro support,

NASM has it too, and NASM's macros are especially powerful, allowing a context stack to be manipulated (so that your macros can nest, etc)

Quote: and computing on symbols. For example of the last, you can define a label and then compute things like Code: ((my_lbl shr 12) and 0xFF00) or 0x3 which is very useful to construct tables, like GDT and IDT without the need of construction code.

okay, that one was a bit harder to get working since i got messages saying "operator >> can only be applied on scalar values", but just patching to Code: [org 0] my_lbl:

my_modified dd (((my_lbl-$$) >> 12) & 0xff00)

3

Honnestly, the only "real" argument i see to prefer FASM is that it is self-compiling... yet i'm not sure it is a so great advantage: you could easily compile a version of NASM that matches your own OS on your DevStation? (tm) and run the binary in your OS after that... "

ppl on http://forum.osdev.org/viewtopic.php?f=1&t=10402&start and https://www.programmersheaven.com/discussion/272721/fasm-masm-tasm-nasm-i-dont-know-adriaaann seem to prefer fasm to nasm

" if you don't already know x86 assembly language then you've only got five reasonable alternatives: MASM, TASM, NASM, Gas, and HLA. These are the only reasonable choices because they're the only assemblers that have books geared to the beginner written for them. " "NASM takes the good parts of MASM and fixes the mistakes (such as the offset operator). "

http://forum.osdev.org/viewtopic.php?f=1&t=10402&start=15 seem to slight prefer nasm/fasm to gas

SEE ALSO: https://news.ycombinator.com/item?id=11514123 and parent https://news.ycombinator.com/item?id=11513716 http://cs.lmu.edu/~ray/notes/x86assembly/ http://wayback.archive.org/web/20160307203948/http://staff.mmcs.sfedu.ru/~ulysses/Edu/MP/WhichAsm.html http://www.nasm.us/doc/nasmdoc2.html#section-2.2 for specifics on why NASM syntax is better than GAS http://nullprogram.com/blog/2015/04/19/ summarizes http://x86asm.net/articles/what-i-dislike-about-gas/

so far sounds like the ones to check out are:

"FASM...is so extensible it doesn't actually have to be limited to any of the architectures listed. For example you can use it as a fully working DCPU-16 assembler using macros. " [13]

---

We should also survey macro processors:

should also check out other 'higher level' assembly languages:

HLA: https://en.m.wikipedia.org/wiki/High_Level_Assembly TAL: https://www.cs.cornell.edu/talc/default.html COQASM: http://research.microsoft.com/en-us/um/people/nick/coqasm.pdf

---

"MASM, TASM, and HLA provide several high-level-like control structures (e.g., IF, WHILE, REPEAT..UNTIL, and FOR) ...none of NASM, FASM, or RosAsm? support these...Nevertheless, each of these assemblers provide macros to simulate these control statements (with varying degrees of success). Unfortunately, such macros rarely do as good a job as the real statements in languages like HLA or MASM. "

---

biggest complaints about GAS syntax:

" The use of % sigils on all registers is tedious. I’m sure it’s handy when generating code, where it becomes a register namespace, but it’s annoying to write.

    Integer constants are an easy source of bugs. Forget the $ and suddenly you’re doing absolute memory access, which is a poor default. NASM simplifies this by using brackets [] for all such “dereferences.”"

"My one complaint ((about NASM syntax)) would be that it’s that it’s too flexible about labels. The colon on labels is optional, which can lead to subtle bugs. NASM will warn about this under some conditions (orphan-labels). Combined with the preprocessor, the difference between a macro and a label is ambiguous, short of re-implementing the entire preprocessor in Emacs Lisp."

[14]

GAS (ATT syntax) has dest argument last, Intel syntax has dest argument first. [15] argues for dest, src (Intel).

In Intel syntax, an un-sigil'd number means immediate constant (in gas, it means abs addr).

In Intel, x86-32 bit addressing is eg:

 push segment:[base + scale*index + displacement]

eg

 sizeof.ITEM = 8
 ITEM.foo = 2
 mov ax, [table + sizeof.ITEM*ebx + ITEM.foo]

" x86-64 addressing

Adding support for AMD64 architecture caused great troubles in all assemblers. AMD64 provides RIP-relative addressing. With this addressing, you can make position-independent code very easily. This addressing is what you want 99.9% times when writing 64-bit code. Using absolute addresses is still possible, but pretty much limited to 4GB address space, and they require relocations. The only exceptions are mov instructions (opcodes A0h-A3h), where absolute 64-bit addressing is enabled.

Some assemblers (YASM, GAS) still use absolute addresses by default in 64-bit mode, and they require explicit notation for RIP-relative addressing. This makes writing position-independent code pain.

 mov variable(%rip), %eax

Other assemblers (FASM, MASM) use RIP-relative addressing by default.

...

 Feature that I lack in GAS is ability to assign size to label. In many other assemblers you can assign size to label, and if that label is used as address, and no explicit size is given for instruction, assembler uses size associated with this label:
 var1 dd 10
 label var2 dword
 mov [var1], 10
 cmp [var2], 0

...

External Symbols

Very annoying thing in GAS is that it treats all undefined symbols as external dependencies. That means if you mistype some name, your program compiles fine, and you can find error only later during linking, having to go back to assembly source.

If you are unlucky, you may happen to have symbol with such name defined in other module, linking will succeed too, and you have nasty bug to look for. If GAS wouldn't beheave this way, you could catch such bug immediately.

As for other assemblers i know, only FASM and MASM solve this issue properly. See my External Dependencies in Assemblers article for more details.

"

" NASM

In NASM, you declare external dependencies using %extern. The problem is that every symbol declared as %extern always produces an external dependency in the resulting object file. There is no way to declare external dependency on symbol only if symbol is used.

The only way "solve" this problem is enforcing the use of macro on the user of the library. This macro itself declares symbols as external, and symbols cannot be used with anything other than the macro. ...

MASM

In MASM, you declare external procedure using the PROTO statement, and external data should be declared using EXTERNDEF. These produce dependency only when the symbol is actually used. This is fine for our purposes, and suits MASM's high-level style well.

For its users, MASM solves this problem properly. ... FASM

 In FASM you have the extrn directive that works just like the ones in MASM and NASM. It always creates dependency, even when symbol is not referenced in sources. But FASM also has operator used that checks whether the symbol is used anywhere in source. You can use it to conditionally compile extrn statement. So you declare every external symbol this way:

if used something extrn something end if

"

"In Intel syntax instructions are not suffixed. In AT&T there is a suffix (q, d, w, b) depending on the operand size.

Hexadecimal values are prefixed with 0x instead of suffixed with h. Registers are prefixed with %.

  mov     eax,1
  mov     ebx,0ffh

vs.

  movl    $1,%eax
  movl    $0xff,%ebx"

" But also the notation for accessing memory is different: For encoding the SIB (Scaled Index Byte) (+ disp[lacement], if desired), Intel uses [base+disp+index * scale], while AT&T uses disp(%base, %index, scale). Thus we have "

"

 On the other hand, when the size of the operand can't be concluded from the instruction, in Intel syntax you have to add 'BYTE PTR', 'WORD PTR', 'DWORD PTR' or 'QWORD PTR' to disambiguate the situation. For example
  mov [ebx], 2

is not unique, so in Intel's syntax you have to write respectively

  mov BYTE PTR [ebx], 2
  mov WORD PTR [ebx], 2
  mov DWORD PTR [ebx], 2
  "

" Intel's syntax is most common, not the AT&T one."

" NASM and MASM use what is sometimes called the Intel syntax, while GAS uses what is called the AT&T syntax. GAS uses % to prefix registers GAS is source(s) first, destination last; MASM and NASM go the other way. GAS denotes operand sizes on instructions (with b, w, l suffixes), rather than on operands GAS uses $ for immediates, but also for addresses of variables. GAS puts rep/repe/repne/repz/repnz prefixes on separate lines from the instructions they modify MASM tries to simplify things for the programmer but makes headaches instead: it tries to "remember" segments, variable sizes and so on. The result is a requirement for stupid ASSUME directives, and the inability to tell what an instrction does by looking at it (you have to go look for declarations; e.g. dw vs. equ). MASM writes FPU registers as ST(0), ST(1), etc. NASM treats labels case-sensitively; MASM is case-insensitive. "

http://www.nasm.us/doc/nasmdoc2.html#section-2.2

" NASM Is Case-Sensitive

NASM Requires Square Brackets For Memory References

NASM was designed with simplicity of syntax in mind. One of the design goals of NASM is that it should be possible, as far as is practical, for the user to look at a single line of NASM code and tell what opcode is generated by it. You can't do this in MASM: if you declare, for example,

foo equ 1 bar dw 2

then the two lines of code

        mov     ax,foo 
        mov     ax,bar

generate completely different opcodes, despite having identical-looking syntaxes.

NASM avoids this undesirable situation by having a much simpler syntax for memory references. The rule is simply that any access to the contents of a memory location requires square brackets around the address, and any access to the address of a variable doesn't. So an instruction of the form mov ax,foo will always refer to a compile-time constant, whether it's an EQU or the address of a variable; and to access the contents of the variable bar, you must code mov ax,[bar].

2.2.3 NASM Doesn't Store Variable Types

NASM, by design, chooses not to remember the types of variables you declare. Whereas MASM will remember, on seeing var dw 0, that you declared var as a word-size variable, and will then be able to fill in the ambiguity in the size of the instruction mov var,2, NASM will deliberately remember nothing about the symbol var except where it begins, and so you must explicitly code mov word [var],2. "

https://courses.engr.illinois.edu/ece390/archive/mp/f99/mp5/masm_nasm.html

" NASM Is Case-Sensitive

Uninitialized (.bss) Section

Labels are Everything.

This was true in MASM too, but people for some reason believe that the "proc" stuff was magical. NASM has no build in proc, so everything is a label. A subroutine can be nothing but a label that you "call" to instead of "jmp" to. Of course the call will push the return address on the stack, and the "ret" a the end of your subroutine will pop it, so you can't just "jmp" to your subroutine or it will return to some random value that just happened to be on the stack.

Also, you may notice the widespread use of .label's. These are "local" labels that are local to the last label without a dot before it. This is so you don't have to worry about having two labels with the same name. You also don't need a colon after the label name, but it has to appear on a line by itself.

NASM Requires Square Brackets For ALL Memory References

NASM Doesn't Store Variable Types "

"NASM syntax is much better than MASM syntax. "

so probably:

"FASM is better maintained and has superior macro support. Also it's multi-platform."

Best x86 assembler poll (on a FASM site):

http://wayback.archive.org/web/20150815022227/http://board.flatassembler.net/topic.php?t=6222

bogdanontanu (Sol_Asm guy?) says " However from an experienced ASM programmer's point of view the greatest downside of FASM is it's macro system that is very complicated to read and understand and has a big design flaw. "

sol_asm has " Macros with local variables and argumets: MACRO/MARG/EXITM " http://www.oby.ro/sol_asm/docs/sol_asm_manual.htm#7.6

---

quick notes on AT&T asm syntax:

" the prefix of an operand indicates its type register names are prefixed by % as in %rbp, constants are prefixed by $ as in $16 and symbols are not prefixed as in f; operations suffixed by 'q' operate on 64-bits operands, operations suffixed by 'l' operate on 32-bits operands; the 'e' version of a register (eax) is used to designate the low 32-bits of the corresponding 64-bits 'r' register (rax); the destination operand is the second, meaning that add %rax, %rbx will add rbx and rax and store the result in rbx; operands of the form n(%reg) designate the memory address stored in reg shifted by an offset of n. " -- [16]

---

So in summary so far:

Check out:

note: "FASM...is so extensible it doesn't actually have to be limited to any of the architectures listed. For example you can use it as a fully working DCPU-16 assembler using macros. " [17]

and some specifics:

---

so what's 'terse'?

http://www.terse.com/howdoes.htm

the TERSE statements:

	eax = ebx; bx + dog; cat - 14; cx & 0Fh; dx - 123?

generate the following assembly code:

	Mov	eax,ebx
	Add	bx,dog
	Sub	cat,14
	And	cx,0Fh
	Cmp	dx,123

This is an if-then-else. It sets eax to one if eax was ten, otherwise it sets it to zero. It does this with a Cmp, a Jnz, a Mov, a Jmp Short, and a final Mov.

eax - 10 ? =={ eax = 1; },{ eax = 0; };

This one's a while statement. It initializes dx to 1000 and then goes into a delay loop, decrementing dx and looping back to the top of the loop while dx is non-zero. The code generated for this one is a Mov, a Dec, and a Jnz.

dx = 1000; { dx-; }<>;

	es = ax = 0A000h;               \ Mov ax,0A000h; Mov es,ax;
	ax + bx - cx & dx;              \ Add ax,bx; Sub ax,cx; And ax,dx;

Notice that chained assignments are performed right to left and computations are performed left to right.

English: Q: How do you select 640x480x16c graphics mode? A: Use BIOS interrupt 10h with: ah=0; al=13h.

Q: How do you write a buffer of data to a file handle? A: Use DOS interrupt 21h with: ah=40h; bx=handle; cx=count; ds=seg buffer; dx=offset buffer. On return, carry is clear if successful.

Terse:

	Q:	How do you select 640x480x16c graphics mode?
	A:	Use BIOS:
		ah=0; al=13h; !10h;
	Q:	How do you write a buffer of data to a file handle?
	A:	Use DOS:
		ah=40h; bx=handle; cx=count;
		ds=S(buffer); dx=O(buffer);
		!21h; <<Error;

>> es = ax = 0A000h; \ Mov ax,0A000h; Mov es,ax; >> =ax =bx =cx; \ Push ax; Push bx; Push cx; >> ax= bx= cx=; \ Pop cx; Pop bx; Pop ax;

---

Terse doesn't seem too useful to us, since it appears to obscure the bytecodes that will be generated; concise, but not so good for beginners

---

"

A very basic assembler

An assembler for a simple processor, like one of the Microchip PICs, is relatively easy. All instructions are the same length, and the standard basic assembly language doesn't contain fancy syntax for addressing modes.

All you're left to do is create a simple parser (the only complication is handling expressions), maintain a symbol table for labels and their values, keep track of where the next available location is, and generate absolute (nonrelocatable, nonlinkable) code.

The generated code can be in any format acceptable to a PIC programmer. A number of EPROM programmers can program PICs.

Two passes are sufficient to handle the "forward reference" problem for the PICs. The first pass determines the values of all labels. The second pass evaluates all expressions, and generates code.

A typical restriction on equates (EQU) is that any labels used in the expression are previously defined. That allows the equate label to be defined on the first pass.

Common complications

Conditional assembly, macros, linkable modules. Each subject is a long discussion.

For the ambitious

The standard Intel, and thus Microsoft, syntax for the x86 has some characteristics that make it especially difficult to handle. Bit format is determined by operand kinds. Some instructions have both a short form and a long form. Some keywords are both instruction names and expression operators (NOT, AND, OR).

And if you want to implement all the high-level features of MASM, it could take a while to work out... Posted on 2001-08-22 13:17:32 by tank "

"

It all depends on what you call an Assembler. As "lifewire" say (though i suspect with Scalp that he never tried it...), writing a simple encoder is simple enough. One one hand, you have a source, on the other hand, a set of tables that you use to translate instructions in as many encodings. If the source is supposed perfect and if the syntax has zero flexibility, you might do this in 2 weeks, if you are clever, and well know x86 organisation.

Now, the real fact is that this is just NOTHING. You have, too, to hold all possible errors the user will do, to give some flexibility to the syntax, to implement a complete Macro Parser. All this is about one year of work for a simple assembler like MASM or NASM. For a full featured Assembler, i mean with IDE, Resources Editors, Linker, Debugger and so on, this is between 3 and 5 full years of work.

Some examples in SpAsm? developpement:

Needless to say, for such things, you first need to have enough money for a living without working outside.

If you want to take a look at what looks like an Assembler, i recommand you FASM, not SpAsm? (i have written SpAsm? encoder a very particular way, that is entirely designed for speed -and this is not at all the regular way-. In fact -i only see it now-, speed of encodage is zero interrest, as the main compiling time is eaten by Macros and Equates unfolding, not by encodage, which takes about no time, anyway).

NASM is very standard and good programation, too, but, halas, written in C. "

" Practically I have started with a simple file containing only a NOP :D

There are so many parts to be done: file management, tokenizer, state machines (lexer), code generation, directives/sections management, HLL, MACROS, OBJ formats generation, IMPORTS and EXPORTS, PE BIN EXE generation...

And a lot of TESTING and unit testing "

[18]

---

NASM vs FASM (article by FASM author):

" With NASM, when you define the variable of some name, this name becomes a constant equal to the address of that variable. Therefore every label is just a constant. Nice and simple, but it was one of those things in NASM that made me dissatisfied. Because I was used to the fact, that when I defined some variable as byte:

        alpha db 0

and then tried to access it like this:

        mov [alpha],ax

TASM would refuse to accept it, because I tried to store some larger data in a smaller variable. This feature was catching many mistakes and I felt I could not waive it. But I still liked the idea of label to be treated just like a constant equal to address, as it made such instructions:

        mov ax,alpha
        mov ax,[alpha]

a straightforward analogy of:

        mov ax,bx
        mov ax,[bx]

"

---

" By the way, don't even consider NASM in current state: There is YASM, which is AFAIK superior to NASM in every way i can think of, maybe except some history."

---

HLA macros:

http://www.plantation-productions.com/Webster/HighLevelAsm/HLADoc/HLARef/HLARef_html/HLAReference.htm#50401371_pgfId-998916

http://www.plantation-productions.com/Webster/www.artofasm.com/Linux/HTML/Macros.html

http://www.plantation-productions.com/Webster/www.artofasm.com/Linux/HTML/DSLs.html

---

So in summary so far:

Check out:

and some specifics:

---

have a 'relation/RDF' instruction (eg the 3 operands form a triple)? could also have various annotations work like this.

---

http://www.plantation-productions.com/Webster/HighLevelAsm/HLADoc/HLARef/HLARef_html/HLAReference.htm#50401371_pgfId-999779 seems to show that control flow constructs such as 'if' and 'while' really should have (context-free) macros rather than just custom instructions (macroinstructions)

should we reserve some of the opcode space for macros? probably

---


Footnotes:

1. big gap

2. fair-sized gap

3.

 gap

4. big gap