Bayle Shanks's website: proj-oot-ootBrainNotes1

custom multiple flavors of indirect addressing

like the types of meta

connection to the RDF verb/hypergraphs?

subsume the two fundamental hypergraph gets (normal and reified) into one operation with multiple flavors?

---

(lack of) call stacks in the brain; what do we have instead?

are call stacks just an abstraction to make it easier to write stuff down? yes, when they are used in a lambda-calculus-y (with no continuations) way; but no, when 'metaprogrammy' stuff like 'break' within a while loop is used, or when 'return' is used in more than one place in the fn, or when exceptions are used for non-exceptional control flow; in these cases, they are usually being used to inject complex sequential/imperative control flow without writing it out explicitly

what does the brain have instead of call stacks? well, for ordinary (lambda-calc-y) use, it just has substructure of neurons... but perhaps the real answer is dataflow variables (labeled lines), since the 'caller' isn't sitting around waiting for the 'callee' to return, it just continues on with the computation with the current (old) value

---

so i guess parts of the paradigm of brainlike computation are:

there is no program counter; instructions are nodes that are all running in parallel (of course, for convenience, in Oot we'd provide PC-like sequencing operations anyways)
each node produces an output value which is duplicated and sent to one or more consumers; eg each node acts as both an instruction and a variable
variables are dataflow variables; when a variable's value changes, all nodes that take that variable as input immediately (or with some delay) recompute based on the new input
it's not clear if there are guaranteed delays/sequencing between when one node sees an update and when another one does. i guess in the brain there could be some circuits for which the delays that introduce sequencing in a certain order where that order is important, so we should provide that option; but in lieu of it, i guess the default is just that, when one variable updates, each of its consumers 'eventually' sees the update, with no guaranteed ordering between consumers
conditionals are generally implemented by running both branches at once, then having a 'gate' node at the end which looks at the condition and uses that to select the output from one or from the other branch
but since we do want this to be a useful programming language, you can abstract any set of nodes into a function, which then can be called multiple times from various other functions, as if each call was a separate duplicate of all of the nodes in the function.
We also permit recursive calls without predetermined call depth limits, although i dunno how the brain would implement that (seems like it would have to be a loop with a PC, right?).
also, dunno how 'routing in the brain' works, of course

---

ericjang 1 day ago

I wasn't aware that Danny Hillis was a student of Minsky's and that the Connection Machine mainframe series was created with brain simulation in mind. That's really cool.

When thinking about implementing the network topology for a large-scale brain simulation, the network topology should reflect the 3D spatial local-ness of the real brain (to avoid redundant N x N communication between units). One seems to either arrive at a fat-tree CM5 architecture or a 3D lattice of asynchronous processors (but this is not very general-purpose).

https://www.youtube.com/watch?v=blvC0DA96dI

versteegen 18 hours ago

But a 3D lattice is actually quite bad in the sense of having a very high graph diameter for the same degree (6) as the number of nodes goes to infinity, compared to other designs, though probably not so bad for low n. Still far better than a square lattice, of course. Wiring of processors in clusters and supercomputers is one of the main uses of graphs with low degree and high diameter (see [1]). I also know that twisted hypercubes were used for connection graph of nodes in supercomputers, because they have a number of nodes which is a power of 2, fairly simple and uniform routing rules (in general other graphs don't), large bisection bandwidth, low diameter, high (maximal?) fault tolerance, and more.

[1] https://en.wikipedia.org/wiki/Table_of_the_largest_known_graphs_of_a_given_diameter_and_maximal_degree reply

---

ah, here's someone else doing massively parallel 'brainlike' computing!

http://apt.cs.manchester.ac.uk/projects/SpiNNaker/architecture/

http://apt.cs.manchester.ac.uk/projects/SpiNNaker/software/

---

is 'Arc Computation' fundamentally different from ordinary ANNs eg 'Node Computation' (where the state is stored in node labels)? Consider a Node A which has 4 neighbors, one of which is Node B. In Node Computation, the states are at A and B, and the relationship between the state in Node A and the state in Node B is 'neighbor'. In Arc Computation, the state is stored in the four edges; the relationship between these edges is not just 'neighbor', because a different node intervenes between each edge and its 'neighboring' edges. So we can't directly transform the arc computation graph into a 'dual' graph where the nodes become arcs and the arcs become nodes (note: in some contexts, 'dual' graph refers not to this, but to interchanging vertices (nodes) and faces).

But perhaps we could just treat this as if the edges in the dual graph are labeled. Eg say that the node representing the edge A-B is 'adjacent' to the node representing the edge A-C via an edge labeled 'A'. One thing to note is that this is not a regular lattice/celluar automata; the label 'A' is not found tesselated throughout the graph, it is unique for those 'edges' representing node A in the original graph.

However, we're losing the information of what the edge directions are, and perhaps the constraint that each edge connects two nodes; eg the mapping from graphs to their node-edge duals may not be 'one-to-one' (is that correct?); otoh we may gain information in that perhaps the mapping is not 'onto' either (is it?)

Another similar transformation is to transform the original graph to a hypergraph, in which nodes are hyperedges. This transformation is exact (i think). So is Node Computation and Arc Computation the same in the hypergraph context?

---

clockless circuits:

so apparently a circuit with arbitrary delays (that computes a known output) is:

https://en.wikipedia.org/wiki/Delay_insensitive_circuit

and apparently that cannot be Turing-complete.

So in order to be able to build Turing-complete stuff, we must make some assumptions. One of the 'least' assumptions we can make is that we have access to 'isochronic forks', which are forks in which if one endpoint of the fork sees a signal, they can assume that the other endpoint has already seen a signal too. The model making this assumption is called Quasi-Delay-Insensitive (QDI).

"An asynchronous circuit in which all forks are assumed isochronic corresponds to what has been called a speed- independent circuit, which is a circuit in which the delays in the interconnects (wires and forks) are negligible compared to the delays in the gates. " [1]

my note: seems to me that in the brain, we are often in the opposite condition from speed-independent; almost all the delays are in the wires and forks (axons), not in the gates (cell bodies) (not sure about this for local axons though). What does the brain do about this?

My guess is that the brain uses plasticity to explore which forks are longer and shorter, and then assumes that this timing relation will not change (or not change too rapidly, too frequently -- it can be re-running the 'calibration plasticity' constantly, and tolerate some errors). I wonder what that model is called in electrical engineering?

according to those wiki pages and the paper [2], computation in QDI circuits proceeds like you might expect; you send some additional signals to sort of indicate "i'm done transitioning now" and then you have handshake protocols between each computing element that ensure that they don't compute mid-transition.

so i would guess that we would see such handshake protocols in the brain.

---

great-looking paper: http://www.async.caltech.edu/Pubs/PDF/2006_ieee_soc.pdf (already added to to-reads) one cool thing about it is that it uses a CSP-like language called CHP (Communicating Hardware Processes)

---

" All circuits presented in the previous sections have stable and noninterfering PR sets. But in order to implement selections in which the conditions are not mutually exclusive (nondeterministic choice), for instance, to select between an external interrupt signal and the internal B next-instruction [ signal in a microprocessor, or for synchronization between the clocked components of a GALS system, at least one operator must provide a non- deterministic choice between two true guards. Since stability and noninterference are equivalent to determinism, a nondeterministic computation does not admit stable and noninterfering circuit implementation. Therefore, any implementation of a nondeterministic computation contains nonmonotonic transitions that require special nondigital solutions. Those special circuits are encapsulated inside two primitive building blocks: the arbiter and the synchronizer . A fundamental, and long misunderstood, issue related to implementing a nondeterministic choice is that of metastability . In classical physics, it is impossible to put an upper bound on the time it takes for a device to make a nondeterministic decision between two alternatives. When the system starts in a state where the physical parameters are such that either alternative can be selected, thedevicemayenterametastablestateinwhichitmay stay an arbitrary length of time before deciding one way or the other [6]. "

my note: if the issue was really just that nondeterminism couldn't be done fast, then since in most circuit-related cases we don't actually need randomness, we could have a buffer of PRNG'd numbers that are used to make the decision (or, in most cases, such as the example given of an interrupt competing with a next instruction, we'd probably just want round-robin: 0101010101). So i suspect that i am misunderstanding, or the paper is misexplaining, the real source of the problem here.

---

so if we have stuff like eg a million object-specific recoginizers running in parallel, then need 'map' to run the recognizers plus a 'reduce' that specifically does something like a select/epoll or winner-take-all to focus on (select) one of the recognizers that thinks it recognized

brain (at a low level) probably has no program counter (PC), and branches are probably implemented by computing both branches then having post-select gates (although mb not; mb the nonlinear dynamic system of a neuron cluster can use the same neurons to compute various different computations depending on what state it is in, or upon its inputs)

routing in the brain?

need to have mini map/reduces (divergence and convergence) within one 'map' step. Probably each 'map' step is written in a dataflow-like manner.

---

So, the brain is very concurrent. How do we model its concurrent computational algorithms? Many higher-level programming language constructs seem to be architecturally unsuited to the brain, which presumably isn't well-modeled by a set of CPUs running programs step-by-step via a program counter, all with access to, and communicating via, a shared memory.

So, you might think, ok, look at EE stuff relating to computer architecture. Models of circuits have elements that 'run' simultaneously and take finite time to send signals to each other.

But, this isn't quite suited either; at a low level it seems to match, but the conventional architecture on top of that doesn't:

clocked circuits
circuit elements with small indegree and outdegree (compare to neurons, which often have very many afferents)
strict separation between CPUs and memory (compare to the brain, which appears to mix them)
a few fast CPUs (compare to the brain, which appears to support many slow CPUs)
statically known circuits with reliable parts (compare to the brain; the exact connections of every brain is different (in higher animals) and in fact even evolves over time; and neurons are relatively unreliable, in fact many individual neurons may even die without noticibly impacting function)

(by virtue of assuming that evolution optimizes, and by noticing how smart humans are, should we conclude that the brain's architecture is better, at least for running whatever sort of algorithms human brains run, and if robustness is desired? Not necessarily; in making brains, evolution operated under a constraint that don't apply to computers, namely, each human brain had to be grown by a human body, without using a large amount of power or rare nutrients; and the human body that grew it itself had to be grown in a similar manner. By contrast, computers can be centrally manufactored in huge factories with expensive hard-to-produce equipment that uses a lot of power. For example of what this might entail, it may be hard to grow a large brain with every connection specified by the design, but it may be easier to manufacture a computer chip with that property. )

This suggests that the distributed systems model (that i talked about in plPartConcurrency; namely, a set of interactive Turing machines (Turing machines + I/O), which can also interact with each other (send messages to each other using I/O), and in which sometimes messages are lost in transit or corrupted, and sometimes the individual machines fail.

but we also need timing delays...

---

how about a:

slowly varying timing model

You can assume that there is a total ordering on how long each connection takes to transmit. This ordering is not constant but changes over time; but it changes relatively slowly, making it possible for your system to constantly recalibrate itself and then only be wrong sometimes. In addition, say that the ordering is not 100% reliable; sometimes the connections transmit significantly faster or slower than you expect (more to the point, express this stuff in terms of pairwise inequalities; if you looked at the entire total ordering, then yes it would change in some respect on almost every tick; but for any pair of connections, the question of which of them is slower is unlikely to change over the timescale of their latency).

I like this because it seems feasible for the brain (short-term and long-term plasticity could do the 'calibration'), and it reminds me of two things:

my memory consistency paper idea, which derived various memory consistency models from fixed relative timing assumptions
quasi-delay-insensitive (QDI) clockless circuits with asymmetric isochronic forks [3]

So, we have a distributed systems model:

set of turing machines
+ lossy channels
+ slowly varying timing models for some channels
+ arbitrary timing for other channels (these channels model over-the-network communication)

---

biologically, concurrency primitives are assessed in terms of 'consensus number', suggesting that the consensus problem is fundamental. Note also the similarity between the 'consensus problem' and EE's "arbiters", demonstrated especially with "sticky bits", a concurrency primitive with infinite consensus number. Winner-take-all networks probably solve this in the brain.

in fact there is a 'univerality of consensus proof' in the original paper (summary at the end of [4])

---

didn't those 'amorphous medium' things have a consensus primitive?

---

" Yet despite their superior power efficiency, GPG- PUs still suffer from inherent inefficiencies of the un- derlying von Neumann execution model. These include, for example, the energy spent on communicating in- termediate values across instructions through a large register file, and the energy spent on managing the in- struction pipeline (fetching, decoding, and scheduling instructions). Recent studies [3, 4] estimate that the pipeline and register file overheads alone account for ∼ 30% of GPGPU power consumption. As an alternative to von Neumann GPGPUs, Voitse- chov and Etsion [5] recently proposed the Single-Graph Multiple-Flows (SGMF) dataflow GPGPU architecture. The SGMF architecture introduces the multithreaded coarse-grained reconfigurable fabric (MT-CGRF) core, which can concurrently execute multiple instances of a control and dataflow graph (CDFG). The MT-CGRF core eliminates the global pipeline overheads and, by di- rectly communicating intermediate values across func- tional units, eliminates the need for a large register file. Furthermore, the underlying dataflow model enables the core to extract more instruction-level parallelism (ILP) by allowing different types of functional units to execute in parallel (e.g., all FP and all LD/ST units). The SGMF architecture, however, is limited by the static mapping of a kernel’s CDFG to the MT-CGRF core and cannot execute large CUDA kernels efficiently. First, the limited capacity of the MT-CGRF core cannot host kernels whose CDFG is larger than the reconfigurable fabric. Second, as all control paths through a kernel’s CDFG are statically mapped to the MT-CGRF, the core’s functional units are underutilized when execut- ing diverging control flows. In this paper, we propose the hybrid dataflow/von Neu- mann vector graph instruction word (VGIW) architec- ture for GPGPUs. Like a dataflow machine, VGIW represents basic blocks as dataflow graphs (i.e., graph instruction words ) and concurrently executes each block for a vector of threads using the MT-CGRF core. Like a von Neumann machine, threads’ control flows determine the scheduling of basic blocks. "

-- Control Flow Coalescing on a Hybrid Dataflow/von Neumann GPGPU

---

could have a 'cellular automata' layout ('fabric') where cores and local memory banks are interspersed, where cores adjacent to a local memory banks can use it, and each core is next to multiple memory banks, and each memory bank is adjacent to multiple cores, and a memory bank acts as 'shared memory', with available RMW and CAS atomic operations to allow cores to synchronize over their local shared memory.

If 2d, the layout could be:

xoxoxo oxoxox xoxoxo

in 3d, the equivalent would have 6 neighbors for each cell.

in 2d, could also do hexagonal, but that can't be symmetric in quite the same way. What you can do with 2d hexagonal is one of:

3 colors (cores, local memory banks, and ?)
4 colors (cores, local memory banks, and ? and ?)
cores, each connected to 6 other cores, with one local memory bank in between each of these connections (so each local memory bank is connected to 2 cores)
or its inverse: memory banks, each connected to 6 other memory banks, with one core in between each of these connections (so each core is connected to 2 local memory banks)
cores, each connected to 6 local memory banks, with each local memory bank connected to 3 cores [5].
or its inverse: memory banks, each connected to 6 cores, with each core connected to 3 local memory banks

if we use the 2d square layout (the one drawn above), then if each core had 32k of non-shared local memory inside it, and each shared local memory bank had 8k, then each core can access its own 32k plus 8*4=32k of shared, for a total of 64k, so in this case we can use 16-bit addressing, with the local memory and all four adjacent banks mapped into a unified memory space for each core. So that's probably what we want. In which case, in terms of words instead of bytes, each core has 16k local words and each shared local memory bank has 4k words. Alternately, though, we could have word-based addressing, in which case we can put a full 32k words per-core local-non-shared, and a full 8k words per-(local-shared-bank), or in other words 64k bytes per-core plus 16k bytes per-bank.

However, in this case, the 64k bytes per-core is now more than our (elsewhere) suggested MCU-derived limit of 32k bytes. So maybe don't do that (or rather, maybe design the software to accomodate that but then leave half of the memory space empty/unavailable in the implementation).

However, this isn't quite satisfactory, for two reasons. The first, and by far most important, is that this doesn't look like a small-world network; in a small world network, most of the connections are dense local connections, like we have here, but every now and then you sprinkle in a long-range connection, which is crucial to the property of being able to get from any node to any other node in surprisingly few hops. Both the brain and human social networks have a 'small world'-style topology.

The second reason this is unsatisfying is that there are only two directions of connectivity, vertical and horizontal. If you route things in 2D and don't allow wires to cross, you can end up with problems (more specifically, a non-planar graph cannot be embedded in a plane without crossings). This second reason isn't really very important, because our cores can be 'routers' facilitating 'dynamic crossings' and ultimately permitting even non-planar communications patterns to exist within this fabric.

The third reason, also pretty unimportant, is that just like CPUs have more L2 cache than L1 cache, it would probably be more useful for the total amount of shared local memory accessible by a CPU node to be larger than the private local memory in that CPU.

The way that the brain and human social networks add the long-range connections is irregular; neither is a regular tiling like cellular automata, and the occasional long-range connections are also irregularly/sorta-randomly placed. But that's too complicated for us to manage/program/manufacture so let's just have a regular tiling with specialized 'routers' in place of long-range connections.

Consider the "p3m1, (*333)"-symmetry colored hexagonal tiling seen in https://en.wikipedia.org/wiki/Hexagonal_tiling#Uniform_colorings ( ) (note that for our purposes this is equivalent to the 3-colored trihexagonal tiling, ). Here we have a regular 2D hexagonal tiling with three colors (red, yellow, blue in the illustration). No node touches anything of the same color; every node has 3 neighbors of each different color (so 6 neighbors total). Although it's tiled in 2D, you can also look at the picture as having 3 'axes'; in the illustration, these are vertical, up-left-to-down-right, and down-left-to-up-right. Along each axis the node colors alternate. If you went along a horizontal axis, the node type would be the same, but these nodes of the same color are not adjacent because other nodes are in between.

Here's an ascii-art illustration of 'p3m1, (*333)' (i think) (interesting that there are seven horizontal spaces there; with less than 7 spaces each 'triangle' of same-colored nodes is not equilateral in the ASCII-art); where does the prime 7 come from in a hexagonal tiling with 3 colors?):

x x x x x x o o o o o

- - - - - x x x x x o o o o o o
- - - -

We could have all of the red nodes be local shared memory banks, all of the yellow nodes be CPUs, and all of the blue nodes be routers. This ratio (1:1:1) probably isn't the right ratio of routers to CPUs to memory, but whatever, we won't know the right ratio until we get further along in designing the system (or possibly even until after we build it and use it in real applications) and we have to start somewhere.

So again we want to fit this into a 64k word address space. Each CPU has 3 memory bank neighbors, plus itself; so if each memory bank has 16k words, and each CPU node has 16k words, then each CPU can access 16*3 + 16 = 64k words total. This even fits without our suggested MCU-derived limit of 32k bytes per node. In this setup the routers and CPUs mostly communicate via the shared local memory banks. An alternative would be for the routers to directly expose some of their memory to adjacent CPUs. In this case you'd probably have the router nodes and memory banks each have 8k words instead of 16, so that the total memory reachable by each CPU is 8*3 + 8*3 + 16 = 64k words total. But i don't like this quite as much because i feel like the memory banks should each be exposing as least as much memory as the CPUs have in their private local memory; after all they are memory specialists and RAM chips today expose a huge blob of memory. So for now we'll stick with the routers not exposing any memory. Perhaps the routers and the CPUs can still communicate when they are adjacent, through some sort of small channel that allows the CPUs to tell the routers when some region within their shared memory contains a packet to be sent, and that allows the router to to tell the CPU when some region within their shared memory contains a packet addressed to that CPU. Note that each adjacent pair of router,CPU are both adjacent to the same (exactly two) memory banks.

This still leaves a lot of design choices in front of us. How 'fat' are the routers and memory banks, that is, how much 'higher-level' functionality do they provide? Since we are imagining that each router and memory bank is in fact its own little microcontroller (or their own process, if this is all running on a GPU), they have the capability to do computation themselves, so they could do quite a lot. Alternately, they could just be 'thin' or 'dumb'.

My guess is that we might want them to do quite a bit ('fat'), which would allow the fabric to provide a variety of VM-like low-level services while still leaving the CPUs mostly free to focus on executing 'userspace code'.

We probably want the memory banks to not only accept 'get' and 'set' commands, but also to arbitrate between the six nodes that can simultaineously access them by implementing atomic/synchronizing operations, eg RMW and CAS. We might even want the memory banks to implement memory allocation, so that when a CPU wants to allocate some memory it asks the memory bank to find it.

We probably want the routers to not only accept packets and pass them on, but also to buffer incoming/outgoing packets into 'mailboxes', to dynamically adapt to traffic conditions, and to have route discovery and exterior gateway protocols.

Can this be moved into 3d? The higher-dimensional analog of 'tiling' or 'tesselation' is called 'honeycomb'. In 3d, there is only one regular honeycomb, the cubic honeycomb. I'm not sure if there is a 3-color uniform coloring of the cubic honeycomb, but if there is it's not listed on https://en.wikipedia.org/wiki/Cubic_honeycomb#Uniform_colorings . However, in 3d there are also 28 'convey uniform honeycombs', which is the (regular) cubic one plus 27 (irregular, i guess) others. See https://en.wikipedia.org/wiki/Convex_uniform_honeycomb . That wikipedia page says that these include "10 prismatic forms based on the uniform plane tilings (11 if including the cubic honeycomb)", so maybe there's some way to do it. I don't understand the rest of that page, but there do seem to be honeycombs in there with 3-colorings, eg "Runcic cubic (ratoh)", and others. In the https://en.wikipedia.org/wiki/Convex_uniform_honeycomb#Prismatic_stacks section, they draw 2d tesselations, and a few of them are drawn with useful 3-colorings: Cubic (each node has 4 neighbors of each color), Truncated/Bitruncated square prismatic (each node has 4 neighbors of each other color), and truncated trihexagonal prismatic/otathaph (each node has 6 neighbors of each other color), and in Scaliform honeycomb, s3{2,6,3} (maybe it's related to 'trihexagonal tiling'?). I don't seem to know enough about this stuff to choose which, if any, of these is good. It seems to me that all the ones listed are just 'lifts' of the 2d situation into 3d; but to make use of 3d what we'd really like is more than the 3 'axes' (vertical, up-left-to-down-right, down-left-to-up-right) that we find in a 2d hexagonal tiling with a uniform 3-coloring. Perhaps this is asking too much (after all since there is 3 axes maybe the simplest thing to do is to just think of the 3-colored hexagonal tiling as just a 2D 'squished' version of some 3-coloring of the 3D cubic honeycomb), or perhaps it is too complicated.

Mb also see https://en.wikipedia.org/wiki/Crystal_system https://en.wikipedia.org/wiki/Space_group https://en.wikipedia.org/wiki/Bravais_lattice https://en.wikipedia.org/wiki/Point_group https://en.wikipedia.org/wiki/Crystal_structure#Classification_by_symmetry

eh, i looked through those briefly but none of them caught my eye and it's way much more stuff than i have time for, given that this is just one of many tasks in a side project of a side project (it's not like i'm going to be manufacturing these anytime soon, and as long as it's all just in software i can easily change the topology later in software). Also, the obvious first things to look for are 3-sided 3D regular polyhedrons, which don't exist, and 6-sided 3D regular polyhedrons, which are just cubes. So the idea (mentioned above) of thinking of the 3-colored hexagonal tiling as just a 2D 'squished' version of some 3-coloring of the 3D cubic honeycomb is probably a good start.

So in 3D we'd have the following 3 planes, stacked:

xo- o-x -xo

o-x -xo xo-

-xo xo- o-x

This almost works but not quite. Consider the center node, which is an 'x'. Above it is an -, to the left is an -, to the right is o, below is o. Negative z-axis is -, positive z-axis is o. So it's not adjacent to any x (diagonals don't count), and it's adjacent to 3 'o's and 3 '-'s. But, do each of the adjacent 'o's share exactly two '-'s with this center 'x'? No, in fact it seems that in a cubic honeycomb no neighbors share adjacencies.

One modification would be to connect diagonals; but now each node would have 26 connections instead of 6, which seems like a lot; also if each memory bank exposed 4k of memory, then since each CPU node would be adjacent to 9 memory banks, that would be 36k, but if each memory bank exposed 8k, then that would be 72k total. But this means that the amount that each memory bank exposes would be much smaller than the private local memory of each CPU node, which may be undesirable, as explained above. I guess if you left out the 'double diagonals' (the 8 corners of the 3x3x3 cube) then you'd only have 18 connections, and only 6 of each kind, but that's still a lot.

There's also non-regular 3D honeycombs, eg https://en.wikipedia.org/wiki/Tetrahedral-octahedral_honeycomb and (mb see also https://en.wikipedia.org/wiki/Quarter_cubic_honeycomb https://en.wikipedia.org/wiki/Convex_uniform_honeycomb ).

Let's see, if you connected the diagonals but left out the 'double diagonals' (the 8 corners of the 3x3x3 cube), then the mask would be:

(this is what you get if you can take 2 steps in any vertical or horizontal directions, except you can't take both of those steps in the same direction).

in the above figure, that's:

 oo-x x

o-x -xo xo-

 xxo- -

as you can see, the 'x' in the middle is connected to exactly 6 of each node type (including 6 other xs).

So here, if each memory bank exposed 8k, then that would be 48k total. Again, the CPU could have 16k private local memory, for a total of 6*8 + 16 = 64k. That doesn't sound so bad (even though, again, the CPU has more private local memory than the amount that each memory bank exposes; but only 2x as much). How are the shared adjacencies? It looks like the 'o' to the right of the middle 'x' shares 4 '-'s with the middle 'x'. So we're good there too. So here, compared to the hexagonal 2D layout, we have 6 neighbors of each kind instead of 3, and 4 shared adjacencies of the 3rd type between any pair of differing types instead of 2, and we have direct adjacencies between nodes of the same type (whereas in 2D we did not have that).

So, it look like we've found our viable 3D 'upgrade' path from a 2D regular hexagonal tesselation 3-coloring. How can we make the design agnostic/easily-upgradable between these? in 2D we had memory banks with 16k memory exposed, and CPUs with 16k private memory; in 3D we have memory banks with 8k memory exposed, and CPUs with 16k private memory. So just restrict the memory bank size to 8k (for now) and we're good; and otoh look for MCUs, GPUs, etc with the assumption that each node may need 16k private memory. And let's assume, as we did elsewhere, that each processor will also need at least 32k program memory.

---

some possibly related stuff

https://en.wikipedia.org/wiki/Manycore_processor
https://en.wikipedia.org/wiki/Massively_parallel
https://en.wikipedia.org/wiki/Transputer
The Connection Machine
Adapteva Epiphany/Parallella
https://en.wikipedia.org/wiki/Blue_Gene
"There is an emerging class of multicore/manycore processors taking the approach of a "network on a chip", such as the CELL processor, adapteva epiphany architecture, Tilera, etc."
"The most powerful supercomputers in the world, based on designs from Columbia University and built as IBM Blue Gene, are real-world incarnations of the transputer dream. They are vast assemblies of identical, relatively low-performance SoC? chips."
possibly Propeller and XMOS (can they connect to neighboring instances?)
- yes XMOS can: https://www.xmos.com/support/boards?product=14756 (64 xCORE Tiles connected as a 4-D hypercube using 5b XMOS links; 512 processors) https://www.google.com/search?q=XK-XMP-64
possibly http://www.jwhanlon.com/files/thesis-print.pdf
https://en.wikipedia.org/wiki/MIMD#Distributed_memory
GA144 (not really that close; it only has 128 words of memory)

---

so i keep saying that GA144 cores ('F18A' i think) have too little memory for what we want. But is this true? I dunno if, say, a neuron has even 4k of memory. And, sure, programming with 128 words per core will be a pain, but isn't the whole point to push us into new programming paradigms. Well,:

We don't know what paradigm we want yet, and we're not sure that it only needs 128 words per core. The range 4k-32k of memory gives us more flexibility.
Even if we had a few paradigms in mind to experiment with, and all of them needed a little less than 128 words per core, it could be that getting close to the limits of how much memory we need like that would force us to think about 'mechanical sympathy' and couple our high-level designs to low-level details, which means when we want to make changes to the high-level design we'd have to rewrite more than if there were high-level low-level decoupling.
Does one neuron only have 128 words of memory? Many neurons have more than 128 synapses, and each synapse has memory. (so should we drop down to one core per synapse? no, we can't afford to buy enough cores in that case; computers are faster than neurons anyways so whatever X unit of description we choose, we expect that the computer implementation will have one core simulating many Xs)
Is one core one neuron? Maybe not; maybe one core is one whole cortical column.
There are 'computer-specific' reasons to want more memory on the core even if the brain-like 'user-written' programs don't need much of it. For example, to have an interpreter in RAM, or routing code in RAM, or debugging/logging stuff in RAM.

---

if you ever come up with something good here mb send to:

http://yosefk.com/blog/the-high-level-cpu-challenge.html

---