proj-oot-lowEndTargets-lowEndTargetsUnsorted2

---

http://www.excamera.com/sphinx/fpga-j1.html

J1 is a small (200 lines of Verilog) stack-based CPU, intended for FPGAs. A complete J1 with 16Kbytes of RAM fits easily on a small Xilinx FPGA. Some highlights:

        Extremely high code density. A complete system including the TCP/IP stack fits in under 8K bytes.
        Single cycle call, zero cycle return
        Instruction set maps trivially to Forth
        Cross compiler runs on Windows, Mac and Unix
        Basic software includes a sizeable subset of ANS Forth and a portable TCP/IP networking stack.

... The J1 is a simple 16-bit CPU. It has some RAM, a program counter (PC), a data stack and a call/return stack. It has a small set of built-in arithmetic instructions. Fields in the J1 instructions control the arithmetic function, and write the results back to the data stacks. There are more details on instruction coding in the paper. ... The CPU was designed to run Forth programs very efficiently: the machine’s instructions are so close to Forth that there is little benefit to writing code in assembler. Effectively Forth is the assembly language. J1 runs at about 100 Forth MIPS on a typical FPGA. This compares with about 0.1 Forth MIPS for a traditional threaded Forth running on an embedded 8-bit CPU. ... The code that defines the basic Forth operations as J1 instructions is in basewords.fs

The next layer up defines basic operations in terms of these simple words. These include many of the CORE words from the DPANS94 Forth standard. Some of the general facilities provided by nuc.fs

        byte memory access
        string handling
        double precision (i.e. 32 bit) math
        one’s complement addition
        memory copy and fill
        multiplication and division, fractional arithmetic
        pictured numeric output
        debug words: memory and stack dump, assert

The above files - about 2K of code - bring the J1 to the point where it can start to define application-specific code. "

"operates reliably at 80 MHz in a Xilinx Spartan-3E FPGA"

---

kragen 104 days ago [-]

I think the GreenArrays? F18A cores are similar in transistor count to the 6502, but the instruction set is arguably better, and the logic is asynchronous, leading to lower power consumption and no need for low-skew clock distribution. In 180nm fabrication technology, supposedly, it needs an eighth of a square millimeter (http://www.greenarraychips.com/home/documents/greg/PB003-110...), which makes it almost 4 million square lambdas. If we figure that a transistor is about 30 square lambdas and that wires occupy, say, 75% of the chip, that's about 32000 transistors per core, the majority of which is the RAM and ROM, not the CPU itself; the CPU is probably between 5000 and 10 000 transistors. The 6502 was 4528 transistors: http://www.righto.com/2013/09/intel-x86-documentation-has-mo...

The F18A is a very eccentric design, though: it has 18-bit words (and an 18-bit-wide ALU, compared to the 6502's 8, which is a huge benefit for multiplies in particular), with four five-bit instructions per word. You'll note that this means that there are only 32 possible instructions, which take no operands; that is correct. Also you'll note that two bits are missing; only 8 of the 32 instructions are possible in the last instruction slot in a word.

Depending on how you interpret things, the F18(A) has 20 18-bit registers, arranged as two 8-register cyclic stacks, plus two operand registers which form the top of one of the stacks, a loop register which forms the top of the other, and a read-write register that can be used for memory addressing. (I'm not counting the program counter, write-only B register, etc.)

Each of the 144 F18A cores on the GA144 chip has its own tiny RAM of 64 18-bit words. That, plus its 64-word ROM, holds up to 512 instructions, which isn't big enough to compile a decent-sized C program into; nearly anything you do on it will involve distributing your program across several cores. This means that no existing software or hardware development toolchain can easily be retargeted to it. You can program the 6502 in C, although the performance of the results will often make you sad; you can't really program the GA144 in C, or VHDL, or Verilog.

The GreenArrays? team was even smaller than the lean 6502 team. Chuck Moore did pretty much the entire hardware design by himself while he was living in a cabin in the woods, heated by wood he chopped himself, using a CAD system he wrote himself, on an operating system he wrote himself, in a programming language he wrote himself. An awesome feat.

I don't think anybody else in the world is trying to do a practical CPU design that's under 100 000 transistors at this point. DRAM was fast enough to keep up with the 6502, but it isn't fast enough to keep up with modern CPUs, so you need SRAM to hold your working set, at least as cache. That means you need on the order of 10 000 transistors of RAM associated with each CPU core, and probably considerably more if you aren't going to suffer the apparent inconveniences of the F18A's programming model. (Even the "cacheless" Tera MTA had 128 sets of 32 64-bit registers, which works out to 262144 bits of registers, over two orders of magnitude more than the 1152 bits of RAM per F18A core.)

So, if you devote nearly all your transistors to SRAM because you want to be able to recompile existing C code for your CPU, but your CPU is well under 100k transistors like the F18A or the 6502, you're going to end up with an unbalanced design. You're going to wish you'd spent some of those SRAM transistors on multipliers, more registers, wider registers, maybe some pipelining, branch prediction, that kind of thing.

---

	A 32nm 1000-Processor Array

http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2016.vlsi.symp.kiloCore.pdf

" 128 x 40 -bit local instruction memory ... Processor data memor ies are implemented as two 128 x 16 -bit banks ... Each of the 12 independent memory module s contain s a 64KB SRAM , service s two neighboring processors "

jacquesm 70 days ago [-]

Another big difference is that most GPU architectures are multi-lane SIMD (so single instructions acting on multiple data but multiple sets of those) whereas the linked architure is MIMD.

---

Wow, so ESP32 has much more ROM/Flash and RAM memory than previous ESP8266 chip:

"Embedded Memory – 448 KB Internal ROM – 520 KB Internal SRAM – 8 KB RTC FAST Memory – 8 KB RTC SLOW Memory"

reply

StavrosK?