Difference between revision 30 and current revision
No diff available.i've been reading about Lua's VM from
http://luaforge.net/docman/83/98/ANoFrillsIntroToLua51VMInstructions.pdf
which is an excellent, easy-to-read document. I'm interested in the VM because Lua seems somewhat small, and the VM seems somewhat small, and i am looking at it as a basis set of operations for Lua (a core language). Seen for this purpose, there are some unnecessarily complex parts of the VM, doubtless for efficiency, such as:
so i guess an assembly for Oot would want to be less complex. Other thoughts:
---
so, addressing modes:
(note: in this scheme there is no stack addressing mode!)
an alternative:
an alternative:
1 bit says: constant if constant: one bit says: immediate constant or constant table constant if non-constant: one bit says: direct or indirect stacks, and the top-two elements of each stack, are memory-mapped (so, as described above, with 4 stacks, that's 16 memory-mapped locations; one bit is whether this is a straight read/write or a pop/push, and one bit is whether this is to the TOS or the second-element-of-stack, and two bits are which stack) could also have one bit for whether we are accessing 'special' memory, like the memory-mapped stacks, or 'ordinary' memory; 'special' memory has the properties that (a) when you read from a special memory location, you might get a value different from what you last wrote to it, (b) 'reads' might cause mutation (e.g. when they are interpreted as POPs). We might extend this to say that special memory is 'volatile' and then accesses to it should be executed immediately and non-reordered because they might have arbitrary side-effects. We might also say that special memory is always thread-local and never shared.
also, as noted above, we need an 'emulation'/'cascaded addressing mode' bit (at least for direct and indirect addressing, do we also need it for constants?). do we need PC-relative addressing? do we need table (eg displacement) addressing?
if we have: constant bit, direct/indirect bit, emulation bit, special bit, that's 4 bits, so if we have 16-bit words, we have 12 bits left, ie 4k. note that in this scheme, addresses can be up to 4k but constants can be up to 16k, because constants don't need 'emulation' or 'special' bits.
(Instead, could use the ARM-ish barrel-shift-ish encoding where the 12 low-order bits of immediate constants are interpreted straighforwardly, but the two high-order bits shift the others, e.g. if the two high order bits are 00 then the value is the value given by the 12 low-order bits, if the two high order bits are 01 then the value is 2*(the value given by the 12 low-order bits), if 10 then 4*(the value given by the 12 low-order bits), and 11 is multiplication by 8, allowing representation of all values up to 4k, and some values with trailing zeros up to 32k. This sounds useful/efficient but is slightly more complicated, so i probably won't do it)
---
in [1] i wrote:
" Note that Haskell is implemented a bit differently from many languages; this stems from Haskell's properties:
It is therefore assumed that Haskell programs will, at runtime, very frequently be dealing with unevaluated thunks, with closures which represent unevaluated or partially evaluated functions, and with the application of functions whose arity was unknown at compile time. "
due to these considerations, the GHC compiler appears at first rather complicated (for example: in reading about it i heard about 'spineless tagless G-machines' and i found a paper describing that and read about a weird system in which to evaluate a function whose arity is only known at runtime you put a bunch of potential arguments on the stack and then jump into code for the function and then IT decides how many arguments to consume, and if it generates another function as a result, it calls that function directly to consume the rest of the arguments, etc; then in later papers i found that they dropped this weird 'push/enter' system for the simpler 'eval/apply' system that i would have expected (the caller looks up the arity of function at runtime and then calls the function with only the arguments it will use, then calling the output of the function if appropriate), and also that they also later switched from a tagless system to one with tags (but not type tags, rather a tag for whether an object has been evaluated yet, and for which outermost constructor was used to generate the object, which is type-specific because different types have different sets of constructors, so we kind of have a type-within-a-type and the type is known at compile-time but the choice of constructor, the type-within-a-type, is not!). And then there are info tables describing each object, and the garbage collector has to read these, and the garbage collector also apparently can sometimes help in propagating tags for some reason, and it all sounds very complicated)
but now that i've read about that, i wonder if i can't boil down the basic operations that the runtime is trying to do into an intermediate VM language, and leave questions like push/enter vs. eval/apply, and of how/where you store the stuff that Haskell puts in type tags and info tables, to the VM implementation (compiler or interpreter). Then Oot Core could do all this Haskelly stuff but the reference implementation could be comprehensible. Of course, the optimized reference implementation could compile this intermediate language into something more efficient, like Haskell does.
Now, what is the difference between GHC Core and the thing i am talking about? I guess it's that i'm talking about an intermediate VM language for what instructions are being executed at runtime, rather than a representation of the source code that is being compiled.
operations like: APPLY (what else)
also, other 'runtime'-y things, such as COW (copy-on-write)
http://www.haskell.org/haskellwiki/Ministg
so, this suggests the place of expressions are included in oot assembly; not only because we could have an AST instead of a linear sequence of VM instructions, but also because (for the same reason that lazy languages are like a reduction of expression graphs), thunks must contain expressions. also, should think about how to generalize/unify this with dynamically passing around references to imperative code.
one piece of the puzzle that is missing is how to combine delimited continuation and friends (eg metaprogramming based on direct manipulation of the call stack) with haskell-style graph reduction this may also be related to the puzzle of how to have tail calls but also support good stack traces; as ministg points out, "Lazy evaluation means that the evaluation stack used by the STG machine is not so useful for diagnosing uncaught exceptions. The main issue is that the evaluation context in which thunks are created can be different to the evaluation context in which thunks are evaluated. The same problem does not occur in eager languages because the construction of (saturated) function applications and their evaluation happens atomically, that is, in the same evaluation context. ". In fact, you might say that in such a language there are not 2, but 4, stack contexts for each function: the lexical (source code) ancestors of the function definition in the AST, the lexical (source code) ancestors of the function application in the AST, the ancestors in the call stack at the runtime time of the function's creation, and the ancestors in the call stack at the runtime time of the function's calling. should also look at di franco's effort to combine call-stack metaprogramming with constraint propagation
people say 'synchronize' and have target language instructions like SYNC but what they really mean is often 'WAIT until the other thread gets here, or WAIT until the other thread does this', so i should call that opcode WAIT.
the idea of haskell data constructors as a type-within-a-type is interesting and possibly a fundamental way to generalize/simplify. So now we seem to have at least 3 types of types: atomic types (Hoon's odors), composite types (C structs, Python objects, Python and Haskell and Lisp lists, Python dicts, Hoon's structural typing), and the choice of data constructor within a Haskell type (eg is this List a Cons or a NilList?? is this Maybe a Nothing or a Just?) in addition, there are interface types vs representation types, which may be an orthogonal axis
why is a g machine graph not tree (recursion, but why?)? http://stackoverflow.com/questions/11921683/understanding-stg says "Haskell's pervasive recursion mean that Haskell expressions form general graphs, hence graph-reduction and not tree-reduction" but i'm not sure i understand that, wouldn't we just be working with a lazily generated (possibly infinite) tree?
another primitive that we may need to deal with is fexprs, that is, (first-class) functions that don't evaluate their arguments first: https://en.wikipedia.org/wiki/Fexpr (how is this different from any function in the context of a lazy language? should we just generalize this to strictness annotations on each argument?)
APPLY COW WAIT GETTABLE? SETTABLE? CPUID -- args in reg, first call CPUID(0) to get highest arg supported: http://en.m.wikipedia.org/wiki/CPUID, vendor id, processor id, feature bits, serial number, core topology, cache config, virtual and physical address sizes GENSYM (see below) MALLOC, MEMFREE UNDEF (done with the contents of this register) VARTYPE PERSPECTIVE PUSH POP GET SET CPY (deep copy; value copy, actually COW?) MOV FILL-IN-TEMPLATE (like python's '%' on strings, but with graphs) MATCH (graph regex) REPLACE (graph regex) and stuff about grammars (parsing and producing?) and stuff about logic and oberon stuff (while loops, etc)? SHALLOWCPY ALIAS, GETALIAS (see [2]) DIRECTGET, DIRECTSET (non-indirection versions of array indexing; take the array stride as arguments; 1d arrays only?) (or, just GET and SET with modalities, but that would decrease code density) ENTER, LEAVE (enter and leave block scopes; something similar may be/seems to be present in Ruby YARV bytecode and Perl55555) PUTSELF (both Lua and Ruby Yarv appear to have an instruction to reference Self/This)
note: list of ops copied to and continued at [3]
from [4]: some core operations from Guile's VM:
a new (to me, not to the world) interpretation of Nock's core instructions:
select, identity, nock, isLeaf, successor, isStructurallyEqual:
in other words:
(note: no mention of stack manipulation/continuations/jumping here, except for 'run a function within an environment')
(note: we can construct functions because we can build an AST as data and then execute it as code using Nock)
(note: why is this 'functional'? maybe b/c: (a) operations are defined in terms of macros ie combinators, (b) to use eg the environment you copy it, you dont mutate it)
this interpretation shows the similarity between even Nock and various other core languages:
so some stack ops:
DROP DUP SWAP OVER ROT DUP2 DROP2 PICK TUCK
DRIP: a b c ... -> a a c ... (MOV reg1 reg2) DIP: a b c ... -> b b c ... (MOV reg2 reg1) NUP: a b c ... -> a b b c ... (PUSH reg2 onto the (rest of the) stack) NIP: a b c ... -> a c ... (POP the (rest of the) stack into reg2) TUCK: a b c ... -> a b a c ... (PUSH reg1 onto the (rest of the) stack) TAKE: a b c ... -> c b ... (POP the (rest of the) stack into reg1)
PUSH POP OVER UNDER PEEK
cow, fold, map as core ops
---
some things u see in a lot of languages:
APPLY (call a function, possibly with partial arguments or currying) APPLYTAIL return GOTO/JMP conditional JMP comparisons, esp. equality a way to make a composite data structure a way to index into or destructure that composite data structure arithmetic and other 'primitive functions' looping constructs variable assignment a way to make a closure (a function with an environment of variables to be used in that function) switch/jump table (special case: 'case' of an ADT) try, raise, catch call FFI
and some have: call/cc, delimited continuations, and other 'first class stack frame metaprogramming' first-class, executable representation of code polymorphic dispatch the nock operator (do something to the environment, then do something to the code, then evaluate the resulting code upon the resulting environment)
some popular data structures: tuples linked lists (used as binary trees in the Lisp or Nock way, e.g. first/rest) random access/packed lists/arrays/vectors dicts graphs? functions, closures, objects, tagged unions/ADTs
stacks seem to be used mostly for keeping track of evaluating an expression, which is shaped like a tree, linearly (eg in the AST you don't need the stack, but in bytecode you do). the call stack can be thought of as an extension to this
could have calls with arbitrary many arguments; have the second operand have a sentinel that means 'composite argument', which is found outside of the fixed-width bytecode (are the composite args embedded following the instruction, or in a special constant table?)
---
---
it should be made clear that the use of <16-bit wide operands does not imply a word size of 16-bits. In fact, the word size is (unspecified? or specified to be infinite?). The <16-bit operands merely impose (a) an upper limit on the number of registers, (b) an upper limit on the number of entries in the constant table for a given code segment (but you can always go to a new code segment with a new constant table), (c) an upper limit on the size of immediate values encoded right in the operand, rather than in a constant table. Note that i say <16-bit operands because (reverse search for 'addressing mode'), some of the bits in the operand field are used to select the addressing mode (so we only have between about 12 and 14 bits left for the actual values, depending on the addressing mode used, and on how we decide to encode addressing modes).
---
analysis of 8-bit and 16-bit video game consoles:
(note: often i say K when i mean KiB?)
I looked at the tech specs of the NES (8-bit), the SNES (16-bit) and the Sega Genesis (sometimes called "16 bit" but appears to have a 32-bit processor with 24-bit addressing and a 16-bit wide data bus; this is sometimes called 16/32-bit). See data at [5], sections "8-bit video game consoles" and "16-bit video game consoles".
My main objectives were to determine how much memory was addressed, and what the instruction sets were. I am particularly interesting in the question of if the <16-bit operand field sizes that i am contemplating in Oot Assembly would be in the ballpark of the amounts of memory needed for systems like these.
Copying from the summary i wrote over there:
" The instruction sets were 6502 (and 65C816, a 6502 extension) (NES, SNES) and 68000 (Genesis). The address sizes were 16-bits (NES), 24-bit (SNES, Sega Genesis). The builtin RAM was 2 K (NES), 128 K (SNES), 72 K (Genesis). In-cartridge memory was typically up to 8 K, but could go up to 64K with 8K banks (NES; unsure about the other 2). (Program) ROM sizes were cheaply 32K (NES), typically up to 256K (NES) or 4 MiB? (SNES), max 512K (NES), in banks of 32K (NES).
Builtin graphic RAM was 2 K (NES) or 64K (SNES, Genesis), cartridge-expandable in banks of 8 KiB? (NES). Max typical CHR RAM was ~128 KiB? (NES, MMC1). Critical numbers regarding max # of sprites were 8 (NES), 64 (NES). "
How does this square with my <16-bit operand size limit proposal?
16 bits would be enough to directly address:
16 bits would NOT be enough to directly address:
But now because of addressing mode, i have less than 16 bits. So let's say i have 13 bits (8k addressable).
13 bits is enough to directly address:
13 bits would NOT be enough to directly address:
12 bits would not be enough to address any of these items.
How does this square with parallelism? The NES sprite critical numbers 8 and 64 seem relevant.
Conclusions:
---
so, if 13 bits is underpowered, maybe we should take the addressing mode bits out of the operands and put them in the first of the 4 16-bit fields, reducing the number of opcodes. We had too many opcodes anyways. If we only need 32k, that's just 15 bits, but since that's close there may be some benefit to having the 3 operand field lengths be an even 16 bits instead of a non-power-of-2. Perhaps on some architectures we'll have to execute less operations in order to emulate/execute OOM code if each of the 3 operands are an even 16 bits (not sure about that b/c we still have to unpack the same bits out of the first field).
Also, come to think of it, if Nintendo provides for 32K of ROM and that's too little for most cartridges so bankswitching is common, then i guess that even a small program on an 8-bit system is often >32kb, so Oot will surely be at least this big. And i guess it's kind of weird and potentially inefficient to have to put program offsets into a constant table when the program is the inner core of an interpreter. And i could be wrong about all these being the right numbers, so at least having 64k to address, when i think 4k to 8k to 32k is needed, future proofs it a tiny bit.
A disadvantage to that is that if we need 3 addressing mode bits per operand, that's 9 new bits that we now stuff into the first field, leaving only 7 bits. If the first field also has an emulation bit, then there's only 6 bits left, e.g. 64 opcodes, the minimum that we contemplated.
Otoh, we are trying to be minimalistic here, so maybe that's okay.
as i said earlier, as with the Dalvik JVM implementation on Android, we expect that there will be faster VM languages to implement Oot in, and that implementations bent on efficiency might compile OOT VM (OVM, i guess) to an idiosyncratic or platform-specific faster language. So not having enough opcodes isn't really a big problem.
Also, we're going to have to have a Jets-like system to replace at least anything in the standard library, and maybe anything in the standard distribution, with optimized platform-specific FFI calls, so we don't need to put eg. every flavor of addition into OotAssembly? (ovm? osm? oosm? oovm?)
Now, if we are doing this, we really can't go below 6 bits for the opcode, so the last scheme described above, in which each operand had 4 addressing mode bits (constant bit, direct/indirect bit, emulation bit, special bit), is too expensive. Even if you left those in the operands, that would mean the operand values would be only 12 bits, and the analysis above calls for at least 13 bits.
The obvious thing to do is to drop the 'special bit' and rely on a slightly dirtier (but more frugal with memory) 'zero page'-ish memory-mapped specials architecture for the stacks, and rely on more traditional mechanisms for threadlocals and volatiles (which makes some sense, because the amount of these that you'll want or have will depend on the platform and the application, and will not always correspond to approximately half of main memory).
so to summarize, the current scheme appears to be:
undefined word size
64-bit instruction size
instructions divided into 4 16-bit fields: ctrl, op1, op2, op3.
ctrl is a 16-bit field divided into smaller fields as follows:
hmm... the emulation here seems a bit unbalanced; what's the use of op1, op2, op3 getting to address 16 bits when emulating if ctrl can only address 6 bits? So let's just have an EMU instruction and give up on the emulation bits. so now the current scheme appears to be:
ctrl is a 16-bit field divided into smaller fields as follows:
it's tempting to take those 4 bits back and add custom opcodes and custom addressing modes...
but also, don't we need an immediate mode? so how about 3 bits for addressing mode:
and, as noted above, reserve one bit out of the remaining 7 bits for 'custom opcodes'. Leaving 6 bits for non-custom opcodes (64 opcodes).
---
this doesn't leave any bits for some of my other oot-y ideas, such as:
if we were to do any of these, we'd need to either use prefixes (annotations), or increase the instruction length. Instruction length is currently 64 bits (4 16-bit units), so the next power of 2 is 128 bits, giving us 4 more units. If we did that we could:
as noted elsewhere above, putting the type info separate slows down execution because each opcode implementation must dispatch on type; it's more efficient to just have a separate opcode for each operation-type combination.
arguments against this:
arguments for:
---
i probably had better ideas above, but this makes me wonder again, what a 32-bit instruction length would look like:
1 form bit 1 reserved bit 6 addr mode bits (2 addr mode bits x 3 operands) 6 opcode bits 18 operand bits (3 6-bit operands)
again, however, if we want to fix the length (to 128 bits, since that's what the longer format needs), then maybe these would come in packets of 4:
1 form bit 7 reserved bits 4*30 = 120 bits
also, if we are getting all complicated here, perhaps we should make the 'long' format shorter by just giving up on the 16-bits (!) and say, go ahead and use the constant table more? No, because we want 16-bit indicies into the constant table
well wait, what if we only had 11-bit indices into the constant table? we'd save 20 bits in the long format. we already have 16 unused bits, so that's 36 bits. we can give up 12 of the type info bits, for 48 bits saved. we can give up 12 of the flag bits, for 60 bits saved. we can give up 4 opcode bits, and now we're back to a 64-bit instruction length, albeit one with non-power-of-2 field lengths. In other words:
note the same argument about the EMU addr mode bit being useless if the length of the opcode field is shorter than the operand fields applies here. So instead:
the four flag bits could then be: lazy, map, maybe, typecheck/boundscheck
some comments:
---
an adder has 3 inputs and two outputs; two normal inputs and a carry input, and a normal output and a carry output. This may have implications for what the ideal arity of assembly instructions is.
In traditional assembly, the status register is used to provide hidden inputs and hidden outputs for arithmetic instructions. This is a particular case of the same general idea as accumulator registers, wherein we make instruction encoding shorter by hardcoding certain inputs and outputs to/from special registers. One could imagine that instead of having a fixed status register, every arithmetic instruction would take an additional input , and an additional output saying which register to use as the status register for output. Clearly in the majority of cases this wouldn't help much over just having a fixed status register, so it would waste a lot of space in the instruction encoding (plus, make the execution more complex and hence slower). From the point of view of functional programming, however, this is abusing side effects where you really should be just taking inputs and returning outputs.
In a higher-level language, the need for a carry input could possibly be obviated; for example, the second input could be a tuple which is a pair of (second input, carry). All you really need is two inputs (because you need to have a pair() primitive to create tuples; there's probably a way to get around this with one input, but it would be ugly).
In a higher-level language, the carry output seems similar to an error code.
In a higher-level language, one might think the carry output could be eliminated by using an Option type (a type with special sentinel values, e.g. IEEE floating point's +Inf, -Inf, NaN?); carry could be expressed by returning +Inf or -Inf. This almost does the trick, however, this can't express the difference between adding 1 (first input) + 1 (second input) + 1 (carry input), which yields 11 (1 at the carry output, and 1 at the value output), vs. adding 1 + 0 + 1 or 0 + 1 + 1 or 1 + 1 + 0, all of which yield 10; if you just used a carry sentinel, then both of these cases would yield +Inf (the carry sentinel).
In a higher-level language, two outputs could be expressed using tuples (in this case, a pair); the single output would be (carry, value).
I don't really have any firm conclusions here, those are just thoughts.
---
i guess there's a very simple argument to be made for 16-bit addressing:
16 bits is the smallest power of 2 that could possibly comfortably work; 8 bits is only 256 memory locations, which is clearly too small for many very simple applications
---
more evidence for 16-bit addressing, and more citations for which 16-bit or less CPU families were popular/notable:
6502 (8-bit; Apple I, Atari, Apple II, Nintendo Entertainment System, Commodore 64, BBC Micro) 65C02 (8-bit backwards-compatible extension; Apple IIc; Apple enhanced IIe; BBC Master; ) 65C816 (16-bit backwards-compatible extension; SNES; Apple IIgs)
8080 (various CP/M systems) Z80 (backwards-compatible with 8080) 8086, 8088 (16-bit, backwards-compatible ("source" but not binary) with 8080; IBM PC) 80286 (16-bit, predecessor of 32-bit 386 which is the foundation of present-day x86)
6800 (8-bit, Altair 8800) 6809 (8-bit, backwards-compatible (source-compatible) with 6800; TRS-80) 68HC11 (8-bit embedded, based on 6800) 68HC12 (16-bit MCU, mostly backwards-compatible with the 68HC11, 16-bit program counter) 68HC16 (16-bit MCU, based on the 68HC11)
68000 (16-bit; Sun, Mac, Amiga, Atari ST, Sega Genesis, Sinclair QL, early laser printers, various MCUs)
MSP430 (16-bit MCU)
PDP-11
"The Z80 and its derivatives and clones made up one of the most commonly used CPU families of all time, and, along with the MOS Technology 6502 family, dominated the eight-bit microcomputer market from the late 1970s to the mid-1980s." -- http://en.wikipedia.org/w/index.php?title=Zilog_Z80&oldid=634577740
The Z80 ISA was an extension of the Intel 8080 ISA (binary compatible with 8080 code). The Z80 was an 8-bit architecture with 16-bit addressing [6]. It had 8-bit registers which were in pairs, so the pair could be used as a 16-bit register.
" New syntax
Because Intel had a copyright on their assembly mnemonics,[20] a new assembly syntax had to be developed. This time a more systematic approach was used:
All registers and register pairs are explicitly denoted by their full names Parentheses are consistently used to indicate "memory contents at" (indirection, or pointer dereferencing) with the exception of some jump instructions.[21] All load and store instructions use the same mnemonic name, LD, for LOAD (a return to the simplistic Datapoint 2200 vocabulary); other common instructions, such as ADD and INC, use the same mnemonic regardless of addressing mode or operand size. This is possible because the operands themselves carry enough information.
These principles made it straightforward to find names and forms for all new Z80 instructions, as well as orthogonalizations of old ones, such as LD BC,(1234).
Apart from naming differences, and despite a certain discrepancy in basic register structure, the Z80 and 8086 syntax are virtually isomorphous for a large portion of instructions. Only quite superficial similarities (such as the word MOV, or the letter X, for extended register) exist between the 8080 and 8086 assembly languages, although 8080 programs can be assembled into 8086 object code using a special assembler or translated to 8086 assembly language by a translator program.[22][23] Instruction set and encoding
The Z80 uses 252 out of the available 256 codes as single byte opcodes ("root instruction"); the four remaining codes are used extensively as opcode prefixes:[24] CB and ED enable extra instructions and DD or FD selects IX+d or IY+d respectively (in some cases without displacement d) in place of HL. This scheme gives the Z80 a large number of permutations of instructions and registers; Zilog categorizes these into 158 different "instruction types", 78 of which are the same as those of the Intel 8080[24] (allowing operation of 8080 programs on a Z80). The Zilog documentation further groups instructions into the following categories:
8-bit arithmetic and logic operations 16-bit arithmetic 8-bit load 16-bit load Bit set, reset, and test Call, return, and restart Exchange, block transfer, and search General purpose arithmetic and CPU control Input and output Jump Rotate and shift
No multiply instruction is available in the original Z80.[25] Different sizes and variants of additions, shifts, and rotates have somewhat differing effects on flags because the flag-influencing properties of the 8080 were copied. Load instructions do not affect the flags (except for the special purpose I and R register loads). "
-- [7]
" The first widely adopted 8-bit microprocessor was the Intel 8080, being used in many hobbyist computers of the late 1970s and early 1980s, often running the CP/M operating system; it had 8-bit data words and 16-bit addresses. The Zilog Z80 (compatible with the 8080) and the Motorola 6800 were also used in similar computers. The Z80 and the MOS Technology 6502 8-bit CPUs were widely used in home computers and second- and third-generation game consoles of the '70s and '80s. Many 8-bit CPUs or microcontrollers are the basis of today's ubiquitous embedded systems. " -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098
" The address bus is typically a double octet wide (i.e. 16-bit), due to practical and economical considerations. This implies a direct address space of only 64 KB on most 8-bit processors. " -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098
the http://en.wikipedia.org/wiki/Motorola_6800 also had a 16-bit address bus
wikipedia also lists "12-bit" in http://en.wikipedia.org/w/index.php?title=Template:Computer_architecture_bit_widths&oldid=633925183 but this isn't a power of 2 and isn't as popular as 8-bit and 16-bit
" The IBM System/360 introduced byte-addressable memory with 8-bit bytes, as opposed to bit-addressable or decimal digit-addressable or word-addressable memory, although its general purpose registers were 32 bits wide, and addresses were contained in the lower 24 bits of those addresses." -- http://en.wikipedia.org/w/index.php?title=8-bit&oldid=628245098
4-bit processors often have 10 to 20-bit addressing:
" The first commercial microprocessor was the binary coded decimal (BCD-based) Intel 4004,[2][3] developed for calculator applications in 1971; it had a 4-bit word length, but had 8-bit instructions and 12-bit addresses." -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377
" The TMS 1000, the world's first single-chip microprocessor, was a 4-bit CPU; it had a Harvard architecture, with an on-chip instruction ROM with 8-bit-wide instructions and an on-chip data RAM with 4-bit words. " -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377 . It had 1k bytes of ROM (addressed by 10 bits, in a 6-bit registers and a 4-bit register) and 32 bytes of RAM (addressed as 256 bits of RAM by 2 4-bit registers).
" The Saturn processors, used in calculators such as the commonly used HP-48 scientific calculator, are 4-bit machines; as the Intel 4004 did, they string multiple 4-bit words together, e.g. to form a 20-bit memory address, and most of its registers are 64 bits, storing 16 4-bit digits." -- http://en.wikipedia.org/w/index.php?title=4-bit&oldid=630773377
" Very often, when referring to the word size of a modern computer, one is also describing the size of address space on that computer. ... But this does not always hold. Computers often have memory addresses larger or smaller than their word size. For instance, almost all 8-bit processors, such as 6502, supported 16-bit addresses—[citation needed] if not they would have been limited to a mere 256 byte memory. The 16-bit Intel 8088 had only an 8-bit external memory bus on early IBM PCs, and the 16-bit Intel 8086 supported 20-bit addressing,[citation needed] allowing it to access 1 MiB? rather than 64 KiBs?