Revision 9 not available (showing current revision instead)

proj-oot-ootAssemblyNotes8

https://en.m.wikipedia.org/wiki/Find_first_set gives more info on why count trailing zeros (ctz) and count leading zeros (clz) are popular operations; they are related to log base 2.

---

why have meta bits in the graph data representation, rather than only having meta bits in access fns, eg in the arguments to whatever we ops we have in program code to mimic __get and __set? b/c we want graph to be able to 'modify' accesses, ie, views, ie we want a node to be able to have an edge whose target is not just another node, but the 'meta view' of that other node.

---

so is the stuff i'm think of meta level 3? stuff like the 'metagraph'? no, actually, right now i think that stuff is still just meta level 2. Recall, metalevel 1 ('hyper') is 'a message about a message', metalevel 2 ('meta') is 'a message about the connection between (a message about a message M) and the underlying/target message M; metalevel 3 is about the metalevels, that is, the relation between metalevel1 and metalevel2. Now, you could say that the metagraph is itself metalevel3 because it may contain nodes representing 'views' corresponding to different meta-levels. But it probably contains a node representing a view for metaedges; and another node representing a view for meta-meta-edges. This is distinct from having one node representing 'hyper' and another node representing 'meta'. Also, metalevel 2 already contains the concept of an edge that points to an edge. Also, recall that metalevel 3 is analogous to perspective, eg the way that a 2-d perspective view of a 3-d scene has converging lines to cram 3-d information into 2-d in a 'bent' way; which is analogous to 'rotating' information and presenting it in different ways; a single 'metagraph' doesn't seem to contain this (although perhaps if there were ways to transform the metagraph in various ways to represent metainformation in different ways, such as transforming from a representation in which (there is one node representing a view with metaedges and another node representing a view with metametaedges) into a representation in which (there is one node representing metalevel1 and another node representing metalevel2...).

Similarly, one might think that the 'metagraph' is in fact metalevel4 because it contains many levels in one graph, but i don't think it is, as level 4 is supposed to be even further out there than level 3, and in fact is supposed to possibly generate/include new metalevels.

---

We only allow 12 bit operands in instructions. We are a '16-bit' system; however it is possible for the actual memory to be larger than 2^16, and for the (opaque) pointers to be implemented as larger than 16 bits. However, the 12/16 difference suggests that on at least some naive or small implementations, there will be 4 bits 'left over' in some cases. We could use these as type tags or 'fat pointer' bounds checks [1].

How would we use these 4 bits? How does Haskell GHC use pointer tagging? Who else does that?

---

" nabla9 5 hours ago

Memory allocation feature asked is not a language feature. It has to be baked into web assembly if you want to use portable byte masking with pointers.

It's very low level implementation level detail that enables fast execution of dynamically typed languages. Boxing has high cost and it consumes memory. Type tags embedded in pointers can be very fast. For that to work, you need objects that are aligned with power-of-two byte boundaries.

Adding this feature enables efficient execution strategy for scripting and dynamic languages.

reply "

---

i guess if we only had 4 types, Parrot's "INSP" is a good bet: integers, numbers (floats), strings, other. We could also replace 'floats' with either 'boxed' or 'pointer (reference)' or 'native pointer' or 'dyn'.

---

XPC primitive types:

-- [2]

---

some random ideas on what it might take to make an IL (such as WebAssembly?) readable:

http://wllang.com/ https://github.com/wllang/wll

---

"Usually, a method accepts either a string or a number or a function or an object, etc" [3]

---

created ootVm.txt; moved old proposal from ootAssemblyThoughts to below this; copied current proposal from top of ootAssemblyNotes7.txt to ootVm.txt.

todos:

---

perhaps we put Java-esque restrictions on the data stack for safety? We still need to permit call stack manipulation though, so the HLL can easily implement continuations.

A different strategy would be to let the language implementation do whatever it likes, and only check user code libraries for stack safety restrictions.

todo

---

OLD proposal from ootAssemblyThoughts.txt moved to here (new version copied to ootVm.txt):

motivations and goals

The idea here is that Oot might have 3 well-defined compilation stages:

Oot -> Oot Core -> this Assembly/Bytecode -> platform-specific code

"Oot" would be the high-level language. It would be built from Oot Core using Oot Core's metaprogramming facilities.

The rationale for Oot Core is:

The rationale for Oot Assembly is:

Oot Assembly will make porting Oot easier by:

Some properties we want Oot Assembly to have:

Some things that many traditional assembly languages do that we don't:

What do we mean by 'preserve intent', and why do we want that?

What we mean is "don't map a specific operation S to a special case of a general operation G if, by looking at the result in the form of G, you cannot discover that this is actually S without doing non-local analysis".

Some examples:

There are three reasons to 'preserve intent':

Efficiency: We want Oot Assembly to be reasonably efficient to interpret; however, efficiency should not be at the expense of preservation of intent or customizability. We expect that performance-minded platform-specific implementations might compile Oot Assembly into custom high-performance bytecode formats anyways.

Examples of the consequnces of this choice:

current proposal

note: as of 201602 this has been superceded by the proposal in ootAssemblyNotes7.

Design goals

Ease of implementation

(Massively) parallelizable decoding

supports linear instruction stream; also optionally supports tree structure

Message constituents with a hard upper bound on their length in bits (in this case, 32 bits is the upper bound of werds, and the payload is max length of 24 bits)

supports at least 12-bit addressing (in fact, we support 24-bit addressing, although using some addressing modes on operands of more than 12 bits requires two instructions)

Preservation of HLL intent

Extensibility

note: efficiency is NOT a primary design goal; as with the Dalvik encoding of JVM bytecode, or the LuaJIT? bytecode, we expect that performance-minded implementations may create their own variant encodings. Eg the primary purpose of the 64-bit frame alignment is to support parallelized decoding (although efficiency is a secondary design goal).

Note that in this design, the operation might be indirectly specified via a reference to memory, rather than being immediately specified in the bytecode. This should be useful for encoding of calling first-class functions assigned to local variables.

Syntax

Note that the syntax of Oot Bytecode is a very general syntax that could be used for multiple languages (by varying the operations available, the modalities, the constaints, and the semantics of memory cells). Indeed, within Oot, this syntax is used for at least two 'languages'; one is Oot Instruction Bytecode and the other is Oot Graph Bytecode.

Sentences and phrases: Oot Bytecode is divided into variable-length sentences. Sentences consist of one or more variable-length phrases. Phrases consist of one or more werds

Each werd is either 16 bits or 32 bits. A 16-bit werd consists of an 8-bit header and an 8-bit payload. A 32-bit werd consists of an 8-bit header and a 24-bit payload.

An aside on terminology: The natural language linguistic concept of 'word' is a good fit for what is here called a 'werd' because, like linguistic words, Oot Bytecode werds are composed of a 'root morpheme' (the payload) and other syntactical and modifier morphemes (the header). However, in computing, the term 'word' is already in use to refer to architecture-specific fixed-length 'words', so to avoid confusion, i changed the spelling slightly. If this annoys you, feel free to call them "words"; in fact, in most contexts, i use the spelling "word" myself, and i only use "werd" when i am particularly worried about confusion with architecture-defined words.

Werds are grouped into 64-bit frames. Phrases cannot span multiple frames unless the parentheses construct is used.

Each phrase has one of 8 'roles' within the sentence. Each werd has one of (the same 8) subroles within the phrase. It's possible for multiple phrases within a sentence, or for multiple werds within a phrase, to have the same role or subrole. The ordering of phrases or werds with the same role or subrole is significant. Otherwise, the ordering is insignificant (except that some roles have positional constraints, namely, the first werd of a phrase is always subrole 0, and phrase roles 6 and 7 can only appear as the first phrase of a 64-bit frame).

The bits in the 8-bit werd header are as follows:

EOS/BOP: if this is the first werd in the 64-bit frame, then its an EOS. Otherwise its a BOP. EOS means that this 64-bit frame is the last 64-bit frame in the current sentence. BOP means that this werd begins a new phrase.

role: if this is a BOP, then this is the role of this phrase within the sentence, and the role of this werd within the phrase is role 0. Otherwise, this is the subrole of the werd within the phrase. The 8 roles are:

todo: what about 'metadata'?

addressing modes:

the 'split' modes split the payload in half (so an 8-bit payload is split into 2 4-bit payloads, and a 24-bit payload into 2 12-bit payloads), and apply the specified addressing mode to each half, then combine the two in a 'GET' (or index-into) operation. For example, mode 5 retrieves the contents of the memory cell indicated by the first half of the payload (the direct mode), and then finds the index within that data structure indicated by the second half of the payload; for example, if the first half of the payload is '3' and the second half is '5', and memory location 3 contains an array, then the effective address indicated would be the 5th element within this array.

the semantics of memory cells, indirect addressing, and the GET operation are language-specific

more details on some of the roles:

role 0 phrase details

role 1 phrase details

in this role, addressing modes yield the value found at the effective address

werds with subrole 2 within this phrase can be used to pass 'named arguments'; eg the name of the argument would be in subrole 2, and the value of the argument would be in subrole 1 (or implict in subrole 0)

role 2 phrase details

in this role, addressing modes yield an effective address

werds with subrole 2 within this phrase can be used to pass 'named return arguments'; eg the name of the return argument would be in subrole 2, and the lhs expression of the return argument would be in subrole 1 (or implict in subrole 0)

role 3 phrase details

todo: there should be a way to include a single 8-bit modality payload, but also a way to include arbitrary settings of named modality fields to values.

role 4 phrase details

role 5 phrase details

role 6 phrase details

Role 6 is generically defined as anything that can be translated in a context-independent manner (that is, the translator is only allowed to look at a contiguous group of 64-bit frames at a time, and must be stateless in between groupings) into one or more sentences.

When the payload is 8-bits, the 8-bit payload is modality (that is, it is interpreted as if there were a single-werd phrase with role modality, where the werd had role 0, and this was the payload of the werd), and the rest of the 64-bit frame is as follows:

When the payload is 24-bits, the first 8 bits select the number of 64-bit frames included in the packed segment, the next 8 bits specifies a language-specific packed representation, and the last 8 bits are language-defined. So far this is not used by Oot.

role 7 phrase details

Role 7 reinterprets the addressing mode field to indicate what type of grouping construct is present. All role 7 constructs apply at the granularity of entire 64-bit frames.

The payload of role 7 is always split; the first half of the payload is how many 64-bit frames are spanned by the construct, and the second half is the 'type' of the construct. Frame lengths 0 and 1 may have special meanings, todo. Note that grouping-ending constructs such as right parens have an identical payload to the matching left-parens; since this includes the number of frames spanned, this makes it efficient to jump from the right parens to the corresponding left parens (or vice versa). The most-significant-bit of the 'type' of the construct is reserved. For quasiquote, this means whether or not there is any antiquote within this quasiquote; the semantics of this bit for other modes is reserved for future use.

As an optional extension, a language may support arbitrary-length constructs. In this case, if the length of the payload is the maximum value (2^24 - 1), then this indicates that the construct is actually arbitrarily longer than 2^24 - 2, and at displacement 2^24 - 2 will be found another construct opening werd. Most languages don't support this arbitrary-length construct feature, in which case payload value 2^24 - 1 is illegal.

some of these constructs are used to embed things from a different 'language' within bytecode of a default language. todo figure out which ones?

todo: can parens only enclose single phrases, or can they enclose whole sentences?

todo: What about EOS bits within these constructs?

todo is region annotation of length 0 a point annotation?

todo: can 'sentence length' also specify a 'foreign language' sentence?

note: the difference between 'foreign languages' in role 7 vs language-specific packed representations is that a role 7 'foreign language' uses the same format/syntax as described here, but varies the semantics (eg list of operations, list of modalities, list of constraints, semantics of memory cells, semantics of indirect addressing mode and GET), whereas role 6 is an extension mechanism to contain segments of arbitrary format/syntax.

languages

a language using this syntax must define:

and must either define the semantics of or forbid the use of:

and may define:

numerology

There are 2^24 accessible memory cells (24-bit addressing). The size of the constant table must be less than 2^24.

Because an 8-bit payload can be split into two 4-bit payloads in split addressing mode, memory cells 0-15 can be accessed somewhat more easily than others. Similarly, constant table entries 0-15 can be accessed more easily than others.

Similarly, the largest memory cell location that can be accessed using split mode addressing is 2^12 - 1 (using a 24-bit payload split into 2 12-bit payloads). Similarly, constant table entries up to 2^12 - 1can be accessed more easily than others.

profiles

There are many aspects of this format which aid extensibility but which may make implementation more difficult. Therefore, although they are described above, the 'default profile' turns them off.

todo: let's have short construct length limits by default so that a default-profile interpreter doesn't have to reserve much memory

segment format

todo: need to specify constant table format, format for file containing both constant table and bytecode, etc? or is this implementation-dependent? (i'm leaning towards implementation-dependent, although in that case we should still specify it for the Oot language)

natural languages

some inspiration was taken from the syntax of natural languages (i took a course once in head-driven phrase structure grammar, so i wouldn't be surprised if this turned out to be particularly close to that). In addition, this syntax is probably sufficient for encoding most natural language sentences (assuming you are willing to do things like sometimes use multiple Oot Bytecode sentences to represent one natural language sentence).

Here's how you might encode some natural language constructs:

semantics: Oot Instruction Bytecode

indirect: todo; this probably will have something to do with aliasing or symlinking but it's not certain (see [[ootAssemblyNotes5?]]