proj-oot-lowEndTargets-lowEndTargetsUnsorted2

---

http://www.excamera.com/sphinx/fpga-j1.html

J1 is a small (200 lines of Verilog) stack-based CPU, intended for FPGAs. A complete J1 with 16Kbytes of RAM fits easily on a small Xilinx FPGA. Some highlights:

        Extremely high code density. A complete system including the TCP/IP stack fits in under 8K bytes.
        Single cycle call, zero cycle return
        Instruction set maps trivially to Forth
        Cross compiler runs on Windows, Mac and Unix
        Basic software includes a sizeable subset of ANS Forth and a portable TCP/IP networking stack.

... The J1 is a simple 16-bit CPU. It has some RAM, a program counter (PC), a data stack and a call/return stack. It has a small set of built-in arithmetic instructions. Fields in the J1 instructions control the arithmetic function, and write the results back to the data stacks. There are more details on instruction coding in the paper. ... The CPU was designed to run Forth programs very efficiently: the machine’s instructions are so close to Forth that there is little benefit to writing code in assembler. Effectively Forth is the assembly language. J1 runs at about 100 Forth MIPS on a typical FPGA. This compares with about 0.1 Forth MIPS for a traditional threaded Forth running on an embedded 8-bit CPU. ... The code that defines the basic Forth operations as J1 instructions is in basewords.fs

The next layer up defines basic operations in terms of these simple words. These include many of the CORE words from the DPANS94 Forth standard. Some of the general facilities provided by nuc.fs

        byte memory access
        string handling
        double precision (i.e. 32 bit) math
        one’s complement addition
        memory copy and fill
        multiplication and division, fractional arithmetic
        pictured numeric output
        debug words: memory and stack dump, assert

The above files - about 2K of code - bring the J1 to the point where it can start to define application-specific code. "

"operates reliably at 80 MHz in a Xilinx Spartan-3E FPGA"

---

kragen 104 days ago [-]

I think the GreenArrays? F18A cores are similar in transistor count to the 6502, but the instruction set is arguably better, and the logic is asynchronous, leading to lower power consumption and no need for low-skew clock distribution. In 180nm fabrication technology, supposedly, it needs an eighth of a square millimeter (http://www.greenarraychips.com/home/documents/greg/PB003-110...), which makes it almost 4 million square lambdas. If we figure that a transistor is about 30 square lambdas and that wires occupy, say, 75% of the chip, that's about 32000 transistors per core, the majority of which is the RAM and ROM, not the CPU itself; the CPU is probably between 5000 and 10 000 transistors. The 6502 was 4528 transistors: http://www.righto.com/2013/09/intel-x86-documentation-has-mo...

The F18A is a very eccentric design, though: it has 18-bit words (and an 18-bit-wide ALU, compared to the 6502's 8, which is a huge benefit for multiplies in particular), with four five-bit instructions per word. You'll note that this means that there are only 32 possible instructions, which take no operands; that is correct. Also you'll note that two bits are missing; only 8 of the 32 instructions are possible in the last instruction slot in a word.

Depending on how you interpret things, the F18(A) has 20 18-bit registers, arranged as two 8-register cyclic stacks, plus two operand registers which form the top of one of the stacks, a loop register which forms the top of the other, and a read-write register that can be used for memory addressing. (I'm not counting the program counter, write-only B register, etc.)

Each of the 144 F18A cores on the GA144 chip has its own tiny RAM of 64 18-bit words. That, plus its 64-word ROM, holds up to 512 instructions, which isn't big enough to compile a decent-sized C program into; nearly anything you do on it will involve distributing your program across several cores. This means that no existing software or hardware development toolchain can easily be retargeted to it. You can program the 6502 in C, although the performance of the results will often make you sad; you can't really program the GA144 in C, or VHDL, or Verilog.

The GreenArrays? team was even smaller than the lean 6502 team. Chuck Moore did pretty much the entire hardware design by himself while he was living in a cabin in the woods, heated by wood he chopped himself, using a CAD system he wrote himself, on an operating system he wrote himself, in a programming language he wrote himself. An awesome feat.

I don't think anybody else in the world is trying to do a practical CPU design that's under 100 000 transistors at this point. DRAM was fast enough to keep up with the 6502, but it isn't fast enough to keep up with modern CPUs, so you need SRAM to hold your working set, at least as cache. That means you need on the order of 10 000 transistors of RAM associated with each CPU core, and probably considerably more if you aren't going to suffer the apparent inconveniences of the F18A's programming model. (Even the "cacheless" Tera MTA had 128 sets of 32 64-bit registers, which works out to 262144 bits of registers, over two orders of magnitude more than the 1152 bits of RAM per F18A core.)

So, if you devote nearly all your transistors to SRAM because you want to be able to recompile existing C code for your CPU, but your CPU is well under 100k transistors like the F18A or the 6502, you're going to end up with an unbalanced design. You're going to wish you'd spent some of those SRAM transistors on multipliers, more registers, wider registers, maybe some pipelining, branch prediction, that kind of thing.

---

	A 32nm 1000-Processor Array

http://vcl.ece.ucdavis.edu/pubs/2016.06.vlsi.symp.kiloCore/2016.vlsi.symp.kiloCore.pdf

" 128 x 40 -bit local instruction memory ... Processor data memor ies are implemented as two 128 x 16 -bit banks ... Each of the 12 independent memory module s contain s a 64KB SRAM , service s two neighboring processors "

jacquesm 70 days ago [-]

Another big difference is that most GPU architectures are multi-lane SIMD (so single instructions acting on multiple data but multiple sets of those) whereas the linked architure is MIMD.

---

Wow, so ESP32 has much more ROM/Flash and RAM memory than previous ESP8266 chip:

"Embedded Memory – 448 KB Internal ROM – 520 KB Internal SRAM – 8 KB RTC FAST Memory – 8 KB RTC SLOW Memory"

reply

StavrosK? 17 hours ago [-]

Yeah, ten times more, which is great, but I didn't practically run into limitations of the ESP8266 either. Then again, I'm not representative, but I'm saying that even 40K is plenty for most things.

reply

michaelt 14 hours ago [-]

It turns out 40K isn't much if you want a TLS stack that supports modern standards and keysizes, and interoperates with most servers out there [1]. Once you have 16K transmit and receive buffers (the default fragment length) and enough space to do the math needed for a 4096-bit RSA key, you're pretty much out of RAM.

And that's just the system library for a single TLS connection - the programmer will probably want to use some memory too :)

[1] https://github.com/esp8266/Arduino/issues/1375#issuecomment-... https://github.com/esp8266/Arduino/issues/43#issuecomment-16...

reply

---

mRISCV RISC-V MCU 4k memory "the equivalent of commercial microcontrollers implemented with an ARM M0 core." [1]

---

https://www.parallella.org/2016/10/05/epiphany-v-a-1024-core-64-bit-risc-processor/?

1024 64-bit RISC processors 64-bit memory architecture 64-bit and 32-bit IEEE floating point support 64 MB of distributed on-chip SRAM 1024 programmable I/O signals Three 136-bit wide 2D mesh NOCs

https://news.ycombinator.com/item?id=12645661

---

https://nextthing.co/pages/chippro#undefined4

1GHz ARMv7-A 256MB/512MB DDR3/SLC NAND $16 ($6 for just the R8 SOC and 256MB DDR3)

---

actually high end but relevant for this list

Rex Neo

observation: "Existing processor architectures were designed in a time where the amount of energy to move data was roughly equal to the amount of energy required to do useful computation with that data. ((but now)) Moving 64 bits from memory takes over 40x more energy than the actual double precision floating point operation being performed with that data." [2]

goal: high performance per watt (they achieved 10x the typical value)

main idea: dispense with caches (cache hierarchy) "The #1 inefficiency in processors today is the standard hardware managed cache hierarchy, which can be removed and save over 50% of die area. " [3]

"Excluding hardware-managed cache hierarchy enables 5x higher memory energy efficiency" [4]

" A short history of caches “You(can’t(fake(memory(bandwidth(that(isn’t(there”(–Seymour(Cray( • Caches are great... • Hide latency due to slow main memory • CDC 6600 (1964) with instruc9on stack through the ini9al Motorola 68k (1982) instruc9on caches. It only started to get bad when chip designers had a bunch of “free” transistors.

• Except kills power efficiency when implemen9ng virtual memory • IBM System 360 Model 67 (1967) was the first to implement Virtual Memory with what we would now call a MMU (at that point a “Dynamic Transla9on Box”) • Addi9onal logic for TLBs and other logic on a modern chip uses 30%K50% of the die area. • Virtual(Memory(transla%on(and(paging(are(two(of(the(worst(decisions(in(compu%ng(history( • Adds latency and power usage for making things a bit easier for a programmer... It may be “Beger” programming, but it is not “Faster” or “Cheaper”.

Chip designers got lazy with VLSI • Intel i286 (1982) was one of the first commercial chips to implement memory protec9on, and expanded upon the memory segmenta9on of the 8086. • Intel i386 (1986) was the first to implement an external cache (16 to 64K), but it was not replica9ng anything higher up in the hierarchy. • The 486 (1989) the first chip to implement an 8K on die cache. • The Pen9um and Pen9um Pro con9nued this trend, implemen9ng a managed cache hierarchy followed by a second level of cache. • Introduc9on of hardware managed caching is what I consider “The beginning of the end”, in which addi9onal hardware complexity was acceptable due to not caring how many transistors were used, as they were get faster, lower power, and cheaper every year. • This does not work well once the power wall is reached, and completely breaks down with the end of Moore’s law. " [5]

manycore MIMDsupports many programming models, including Actor, CSP, PGAS, SHMEM, and Systolic Array/Dataflow.

ISA: 98 RISC-like Fixed length 64 bit instruc9ons per func9onal unit, packed into a 256 bit VLIW.

btw they used Chisel [6]

---

http://www.kalrayinc.com/kalray/products/ manycore

---

HTTP cookie must be 4k or less

---

" Ajay Sekar ... 128-bit

    ...
    Universally Unique Identifiers (UUID) consist of a 128-bit value.
    IPv6 routes computer network traffic amongst a 128-bit range of addresses.
    ZFS is a 128-bit file system.
    GPU chips commonly move data across a 128-bit bus.[1]
    128 bits is a common key size for symmetric ciphers and a common block size for block ciphers in cryptography.
    128-bit processors could be used for addressing directly up to 2128 (over 3.40 × 1038) bytes, which would greatly exceed the total data stored on Earth as of 2010, which has been estimated to be around 1.2 zettabytes (1.42 × 1021 bytes).[2]
    ...
    The AS/400 virtual instruction set defines all pointers as 128-bit. This gets translated to the hardware's real instruction set as required, allowing the underlying hardware to change without needing to recompile the software. Past hardware was 48-bit CISC, while current hardware is 64-bit PowerPC. Because pointers are defined to be 128-bit, future hardware may be 128-bit without software incompatibility.
    ...."

---

list of CPUs in spacecraft:

http://www.cpushack.com/space-craft-cpu.html

some common or notable ones seem to be:

https://en.wikipedia.org/wiki/MIL-STD-1750A ("a MIL-STD 16 bit non-RISC CPU." "Before the RAD family of 32 bit CPUs were used in space missions, the MIL-STD-1750A (a CPU that could run modern applications) saw substantial use." spec: https://stellar.cleanscape.net/stdprod/xtc1750a/resources/research.html ) https://en.wikipedia.org/wiki/Intel_MCS-51 https://en.wikipedia.org/wiki/Intel_80386 https://en.wikipedia.org/wiki/NSSC-1 ("NASA Standard Spacecraft Computer" "Since the arrival of the IBM RAD6000 in the 2000s and the RAD750 in the 2010s, using the NSSC-1 has become unthinkable. ") https://en.wikipedia.org/wiki/IBM_RAD6000 (32-bit; "In addition to the CPU itself, the RAD6000 has 128 MB of ECC RAM." (POWER ISA (POWER1?)); POWER1 has an 8k icache and a 32k dcache) in the past: https://en.wikipedia.org/wiki/RCA_1802 ("In the 1980s the RCA 1802 was used for many missions—like Galileo. " http://wayback.archive.org/web/20080804225228/http://www.baesystems.com/BAEProd/groups/public/documents/bae_publication/bae_pdf_eis_sfrwre.pdf "8k internal cache"; in that pamphlet, for "Dual/Triple Redundant Computer Subsystem", 4MB RAM)

"As you can see a WIDE variety of chips are used in space. Today most are progressing towards 32-bit CPUs with memory management"

https://en.wikipedia.org/wiki/RAD750 powerPC, L1 cache 32 KB instruction + 32 KB data

---

in one table, https://www.lume.ufrgs.br/bitstream/handle/10183/114607/000955637.pdf?sequence=1 speaks of "icache and dcache sizes (32KB and 64KB, respectively)"

---

random interesting article: "The Transactional HW/SW Stack for Fault Tolerant Embedded Computing" https://www.lume.ufrgs.br/bitstream/handle/10183/114607/000955637.pdf?sequence=1

---

a comparison of RISC-V, ARM, OpenRISC?, and ForwardCom?:

http://www.forwardcom.info/comparison.html

---

jnwatson 6 hours ago [-]

If I had to guess, this looks to be a strong candidate for the embedded OS market. There are still lots of folks running VxWorks?, QNX, ThreadX?, Mentor Graphics' Nucleus, Green Hills' Integrity.

In fact, it looks a lot like the same general design as Integrity, a microkernel capability-based architecture with as much as possible in user space.

reply

---

http://www.clash-lang.org/

chisel

---

Lattice iCE40 LP1K low-power FPGA is in Samsung Galaxy S5 phone: http://www.latticesemi.com/Products/FPGAandCPLD/iCE40.aspx

" ...programmable logic cells... ...Available in three series with LUTs ranging from 384 to 7680: Low power (LP)...(so we probably have 384 or 1000 logic cells here? see below)...Up to 128 Kbits sysMEM™ Embedded Block RAM (128kbits = 16k bytes) "

http://www.latticesemi.com/~/media/LatticeSemi/Documents/DataSheets/iCE/iCE40LPHXFamilyDataSheet.pdf : LP1k has

" Each Logic Cell includes three primary logic elements shown in Figure 2-2. • A four-input Look-Up Table (LUT4) builds any combinational logic function, of any complexity, requiring up to four inputs. Similarly, the LUT4 element behaves as a 16x1 Read-Only Memory (ROM). Combine and cascade multiple LUT4s to create wider logic functions. • A ‘D’-style Flip-Flop (DFF), with an optional clock-enable and reset control input, builds sequential logic functions. Each DFF also connects to a global reset signal that is automatically asserted immediately following device configuration. • Carry Logic boosts the logic efficiency and performance of arithmetic functions, including adders, subtractors, comparators, binary counters and some wide, cascaded logic functions. "

" iCE65 and iCE40 devices are constructed as an array of programmable logic blocks (PLBs), where a PLB is a block of eight logic cells. Each logic cell consists of a four-input lookup table (sometimes called a 4-LUT or LUT4) with the output connected to a D flip-flop (a 1-bit storage element). Within a PLB, each logic cell is connected to the following and preceding cell by carry logic, intended to improve the performance of constructs such as adders and subtractors. Interspersed with PLBs are blocks of RAM, each four kilobits in size. The number of RAM blocks varies depending on the device.[

Compared to LUT6-based architectures (such as Xilinx 7-series devices and Altera Stratix devices), a LUT4-based device is unable to implement as-complex logic functions with the same number of logic cells. For example, a logic function with seven inputs could be implemented in eight LUT4s or two LUT6s.

In December 2015, at 32C3,[29] a toolchain consisting of Yosys (Verilog synthesis frontend), Arachne-pnr (place and route and bitstream generation), and icepack (plain text-to-binary bitstream conversion) tools was presented by Clifford Wolf, one of the two developers (along with Mathias Lasser) of the toolchain. The toolchain is notable for being one of, if not the only, fully open-source toolchains for FPGA development. At the same December 2015 presentation, Wolf also demonstrated a RISC-V SoC? design built using the open-source toolchain and running on an iCE40 HX8K device. As of April 2016, the toolchain supports iCE40 LP1K, LP4K, LP8K, and HX devices.[30] " -- https://en.wikipedia.org/wiki/ICE_(FPGA)

these things are about $4 [7]

---

thesz 18 hours ago [-]

Tell me more about leakage.

NVidia managed to get it right about year and half ago. Before that their gates leaked power all over the place.

The LUTs on Stratix are 6-to-2, with specialized adders, they aren't at all that 4-LUTs you are describing here.

All in all, there are places where FPGAs can beat ASICs. One example is complex algorithms like, say, ticker correlations. These are done using dedicated memory (thus aren't all that CPU friendly - caches aren't enough) and logic and change often enough to make use of ASIC moot.

Another example is parsing network traffic (deep packet inspection). The algorithms in this field utilize memory in interesting ways (compute lot of different statistics for a packet and then compute KL divergence between reference model and your result to see the actual packet type - histograms created in random manner and then scanned linearly, all in parallel). GPUs and/or CPUs just do not have that functionality.

reply

---

www.pynq.io

thesz 18 hours ago [-]

Exactly.

There is an algorithm for detecting bursts of unusual activity which is extremely well suited for FPGAs: http://www.cs.nyu.edu/cs/faculty/shasha/papers/burst.d/burst...

Basically it is a tree of registers with checks. It can compute burst position and duration in the O(log(window size)) time (clocks). You look here at 2.5..5ns (200MHz..400MHz) multiplied by log2(window size) - 25..50 ns for window with 1024 samples. You just cannot get that kind of connectivity with CPU/GPU. Processing these samples in CPU will get you into several hundreds of ns, if not more.

reply

 alain94040 18 hours ago [-]

They are better than GPUs at machine learning inference, so there's that. Ask those guys for some benchmark results, you'll be impressed: http://mipsology.com

reply

vidarh 9 hours ago [-]

Heat (and as a result: size), power generation and cost.

There are plenty of problems where, sure, you could get an x86 or a powerful GPU and do it faster, but you'll be paying for it with power usage.

E.g. take [1] - they're making a modern M68k compatible CPU in an FPGA. They're getting performance that's beating ASIC ColdFire? CPUs (Motorola/Freescale's M68k descendant) and ASIC PPCs with several times the clock, and beating the fastest "real" M68k systems by a factor of 3x-10x.

You could beat that with a software emulator on a powerful enough x86, sure. Easily. Especially if investing enough time in dynamic translation and the like.

But this thing instead fits on a far less power hungry FPGA that gives off far less heat, and fits on a board that'll fit in one of the real small case Amiga's - try to do that with a x86 with heat sink..

[1] http://www.apollo-core.com/

reply

makomk 11 minutes ago [-]

The Apollo core only has a market because Amiga fans are willing to pay a premium in price, complexity, heat and reduced performance to use something other than ARM or x86. I wouldn't be surprised if software emulation on a low-power, cheap ARM core could beat it; JIT on a powerful x86 is apparently 5-7x faster. It's also the exact opposite of open source, being a proprietary core tied to a single-source commercial boards.

reply

rubenfiszel 17 hours ago [-]

For anyone interested in developping applications for FPGAs in a high-level DSL embedded in Scala, this project (https://github.com/stanford-ppl/spatial-lang) from a Stanford Lab might interest you.

Disclaimer: I am part of the lab.

reply

spamizbad 16 hours ago [-]

I noticed there's a lot of RISC-V cores today implemented in Berkeley's Chisel (https://chisel.eecs.berkeley.edu/). How would you say they compare?

reply

rubenfiszel 15 hours ago [-]

Actually, Chisel is one of our main target codegen. We aim to be more high-level than Chisel.

See here for a quick and very incomplete tour: http://spatial-lang.readthedocs.io/en/latest/tutorial.html

reply

---

((in a discussion about FPGAs))

bsder 11 hours ago [-]

Personally, I'd rather have the ability to create a chip for $5K.

It's too stupidly expensive for the CAD tools (>$100K) when a wafer run is less than $20K for a very old process nowadays.

reply

smaddox 2 hours ago [-]

$5K wouldn't cover a single photomask, unless you're talking VERY OLD process.

Maybe there are direct-write (laser or ebeam) litho foundries... I don't know.

reply

neurotech1 1 hour ago [-]

Multi-Project Wafer [0] services like MOSIS [1] reduce the costs significantly. Not sure what the lowest practical budget for a project is, but $5k would be in the ballpark. Access to student licenses for design software is possible, too.

[0] https://en.wikipedia.org/wiki/Multi-project_wafer_service

[1] https://www.mosis.com

reply

petra 6 hours ago [-]

What about structured asic companies like eAsic and baysand ? don't they have a flow from fpga to a 65nm/45nm structured asic , starting at $70K ?

reply

---

http://papilio.cc/

---

BCM4339 wifi chipset ARM Cortex R4 640KB of ROM and 768KB of RAM

---

" uses a matrix as a primitive instead of a vector or scalar ... The DRAM on the TPU is operated as one unit in parallel because of the need to fetch so many weights to feed to the matrix multiplication unit (on the order of 64,000 for a sense of throughput). We ... There are two memories for the TPU; an external DRAM that is used for parameters in the model. Those come in, are loaded into the matrix multiply unit from the top. And at the same time, it is possible to load activations (or output from the “neurons”) from the left. Those go into the matrix unit in a systolic manner to generate the matrix multiplies—and it can do 64,000 of these accumulates per cycle.” ... systolic data flow engine, which is a 256×256 array. When the activations (weights) come in as seen here, there is what is best described as two-dimensional pipeline where everything shifts by a single step, gets multiplied by the weights in the cell, then those weights move down one cell every cycle. Jouppi admits this image doesn’t highlight this step by step clearly, but this systolic step is important. “Neural network models consist of matrix multiplies of various sizes—that’s what forms a fully connected layer, or in a CNN, it tends to be smaller matrix multiplies. This architecture is about doing those things—when you’ve accumulated all the partial sums and are outputting from the accumulators, everything goes through this activation pipeline. The non-linearity is what makes it neural network even if it’s mostly linear algebra,” ... The TPU, by comparison, used 8-bit integer math ... 28 MiB? on-chip memory " [8]

" The TPU’s deterministic execution model is a better match to the 99th-percentile

response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs

(caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more

than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big

memory, the TPU is relatively small and low power. ... Neural networks (NN) target brain-like functionality and are based on a simple artificial neuron: a nonlinear function

(such as max(0, value)) of a weighted sum of the inputs. These artificial neurons are collected into layers, with the

outputs of one layer becoming the inputs of the next one in the sequence. The “deep” part of DNN comes from going beyond

a few layers, as the large data sets in the cloud allowed more accurate models to be built by using extra and larger layers to

capture higher levels of patterns or concepts, and GPUs provided enough computing to develop them. ... Virtually all training today is in floating point, which is one reason GPUs have been so popular. A step called quantization

transforms floating-point numbers into narrow integers—often just 8 bits—which are usually good enough for inference.

Eight-bit integer multiplies can be 6X less energy and 6X less area than IEEE 754 16-bit floating-point multiplies, and the

advantage for integer addition is 13X in energy and 38X in area [Dal16]. ... Three kinds of NNs are popular today:

1. Multi-Layer Perceptrons (MLP): Each new layer is nonlinear functions of weighted sum of all outputs (fully

connected) from a prior one, which reuses the weights.

2. Convolutional Neural Networks (CNN): Each ensuing layer is a set of of nonlinear functions of weighted sums of

spatially nearby subsets of outputs from the prior layer, which also reuses the weights.

3. Recurrent Neural Networks (RNN): Each subsequent layer is a collection of nonlinear functions of weighted sums of

outputs and the previous state. The most popular RNN is Long Short-Term Memory (LSTM). The art of the LSTM is

in deciding what to forget and what to pass on as state to the next layer. The weights are reused across time steps. ... Table 1 shows two examples of each of the three types of NNs—which represent 95% of NN inference workload in our

datacenters—that we use as benchmarks. Typically written in TensorFlow? [Aba16], they are surprisingly short: just 100 to

1500 lines of code. Our benchmarks are small pieces of larger applications that run on the host server, which can be thousands

to millions of lines of C++ code. The applications are typically user-facing, which leads to rigid response-time limits.

Each model needs between 5M and 100M weights (9th column of Table 1), which can take a lot of time and energy to

access. To amortize the access costs, the same weights are reused across a batch of independent examples during inference or

training, which improves performance. ... While most architects have been accelerating CNNs, they represent just 5% of our datacenter workload. ... Name LOC Layers(FC Conv Vector Pool Total) (Nonlinear function) Weights (TPU Ops /Weight Byte) (TPU Batch Size) (% of Deployed TPUs in July 2016)

61%: MLP0 100 (FC: 5 total:5) ReLU? 20M 200 200 MLP1 1000 (FC:4 total:4) ReLU? 5M 168 168

29%: LSTM0 1000 (FC:24 vector:34 total:58) sigmoid, tanh 52M 64 64 LSTM1 1500 (FC:37 vector:19 total:56) sigmoid, tanh 34M 96 96

5: CNN0 1000 (conv:16 total:16) ReLU? 8M 2888 8 CNN1 1000 (fc:4 conv:72 pool:13 total:89) ReLU? 100M 1750 32

Table 1. ​Six NN applications (two per NN type) that represent 95% of the TPU’s workload. The columns are the NN name; the number of

lines of code; the types and number of layers in the NN (FC is fully connected, Conv is convolution, Vector is self-explanatory, Pool is

pooling, which does nonlinear downsizing on the TPU; and TPU application popularity in July 2016. One DNN is RankBrain? [Cla15]; one

LSTM is a subset of GNM Translate [Wu16]; one CNN is Inception; and the other CNN is DeepMind? AlphaGo? [Sil16][Jou15].

...

unified buffer for local activations: 96k x 256 x 8bits = 24 MiB? (29% of chip area) matrix multipy unit: 256 x 256 x 8bits = 64k MAC (24% of chip area) accumulators: 4k x 256 x 32bit = 4 MiB? (6% of chip area)

... the Matrix Multiply Unit is the heart of the TPU. It contains 256x256 MACs that can perform 8-bit multiply-and-adds on signed or unsigned integers. The 16-bit products are collected in the 4 MiB? of 32-bit Accumulators below the matrix unit. The 4MiB? represents 4096, 256-element, 32-bit accumulators. The matrix unit produces one 256-element partial sum per clock cycle. ... When using a mix of 8-bit weights and 16-bit activations (or vice versa), the Matrix Unit computes at half-speed, and it

computes at a quarter-speed when both are 16 bits. It reads and writes 256 values per clock cycle and can perform either a

matrix multiply or a convolution. The matrix unit holds one 64KiB? tile of weights plus one for double-buffering (to hide the

256 cycles it takes to shift a tile in). This unit is designed for dense matrices. Sparse architectural support was omitted for

time-to-deploy reasons. Sparsity will have high priority in future designs. ... The weights for the matrix unit are staged through an on-chip Weight FIFO that reads from an off-chip 8 GiB? DRAM

called Weight Memory (for inference, weights are read-only; 8 GiB? supports many simultaneously active models). The weight

FIFO is four tiles deep. The intermediate results are held in the 24 MiB? on-chip Unified Buf er, which can serve as inputs to

the Matrix Unit. A programmable DMA controller transfers data to or from CPU Host memory and the Unified Buffer. ... ​Systolic data flow of the Matrix Multiply Unit. Software has the illusion that each 256B input is read at once, and they instantly update one location of each of 256 accumulator RAMs.

" [9]

the ISA has only a few instructions and no control instructions (the host sends the instructions one by one). The only notable instruction is the Generalized Matrix Multiply. They say that it can also do convolutions, but i assume this is just via a special case of matrix multiplication(?)

" CNAPS chips

contained a 64 SIMD array of 16-bit by 8-bit multipliers, and several CNAPS chips could be connected together with a

sequencer [Ham90]. The Synapse-1 system was based on a custom systolic multiply-accumulate chip called the MA-16,

which performed sixteen 16-bit multiplies at a time [Ram91]. The system concatenated several MA-16 chips together and had

custom hardware to do activation functions.

Twenty-five SPERT-II workstations, accelerated by the T0 custom ASIC, were deployed starting in 1995 to do both NN

training and inference for speech recognition [Asa98]. The 40-Mhz T0 added vector instructions to the MIPS instruction set

architecture. The eight-lane vector unit could produce up to sixteen 32-bit arithmetic results per clock cycle based on 8-bit and

16-bit inputs, making it 25 times faster at inference and 20 times faster at training than a SPARC-20 workstation. They found

that 16 bits were insufficient for training, so they used two 16-bit words instead, which doubled training time. ... The more recent DianNao? family...use 16-bit integer operations...The original

DianNao? uses an array of 64 16-bit integer multiply-accumulate units with 44 KB of on-chip memory...one successor DaDianNao? (“big computer”) includes eDRAM to keep 36 MiB? of weights on chip [Che14b]. The goal was to

have enough memory in a multichip system to avoid external DRAM accesses. The follow-on PuDianNao? (“general

computer”) is aimed at more traditional machine learning algorithms beyond DNNs, such as support vector machines [Liu15].

Another offshoot is ShiDianNao? (“vision computer”) aimed at CNNs, which avoids DRAM accesses by connecting the

accelerator directly to the sensor ... The Convolution Engine is also focused on CNNs for image processing [Qad13]. This design deploys 64 10-bit

multiply-accumulator units and customizes a Tensilica processor estimated to run at 800 MHz ... Catapult is a TPU contemporary since it deployed 28-nm Stratix V FPGAs into

datacenters concurrently with the TPU in 2015. Catapult has a 200 MHz clock, 3,926 18-bit MACs, 5 MiB? of on-chip

memory,...The TPU has a 700 MHz clock, 65,536 8-bit MACs, 28 MiB?,...Catapult V1 runs CNNs—using a systolic matrix multiplier—2.3X. ... Cnvlutin [Alb16] avoids multiplications when an activation input is

zero—which it is 44% of the time, presumably in part due to ReLU? nonlinear function that transforms negative values to

zero—to improve performance by an average 1.4 times. ... By tailoring an instruction set to DNNs,

Cambricon reduces code size [Liu16]. ...

The TPU leverages the order-of-magnitude reduction in energy and area of 8-bit integer systolic matrix multipliers over

32-bit floating-point datapaths of a K80 GPU to pack 25 times as many MACs (65,536 8-bit vs. 2,496 32-bit) and 3.5 times

the on-chip memory (28 MiB? vs. 8 MiB?) while using less than half the power of the K80 in a relatively small die. ... In summary, the TPU succeeded because of the large—but not too large—matrix multipy unit; the substantial software- controlled on-chip memory; the ability to run whole inference models to reduce dependence on host CPU; a single-threaded,

deterministic execution model that proved to be a good match to 99th-percentile response time limits; enough flexibility to

match the NNs of 2017 as well as of 2013; the omission of general-purpose features that enabled a small and low power die

despite the larger datapath and memory; the use of 8-bit integers by the quantized applications; and that applications were

written using TensorFlow?, which made it easy to port them to the TPU at high-performance rather than them having to be

rewritten to run well on the very different TPU hardware.

"

sounds 2 days ago

parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

I think something that "goes without saying" is that their first-rev design has some essential simplicity to it:

This allows them to get through validation and tapeout very quickly.

cr0sh 2 days ago

parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

This appears to be a "scaled up" (as in number of cells in the array) and "scaled down" (as in die size) as the old systolic array processors (going back quite a ways - 1980s and probably further).

As an example, the ALVINN self-driving vehicle used several such arrays for it's on-board processing.

I'm not absolutely certain that this is the same, but it has the "smell" of it.

---

https://github.com/google/gemmlowp "gemmlowp: a small self-contained low-precision GEMM library" "This is not a full linear algebra library, only a GEMM library: it only does general matrix multiplication ("GEMM")."

throwaway71958 1 day ago [-]

Note however, that on Intel it's actually slower than run off the mill float32 linear algebra library like Eigen or OpenBLAS?. Its main forte seems to be ARM.

reply

mtgx 2 days ago

parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

Nvidia has shipped a couple of INT8 inference cards as well:

http://www.anandtech.com/show/10675/nvidia-announces-tesla-p40-tesla-p4

pklausler 2 days ago

unvote parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

Indeed, any 8-bit x 8-bit function with an 8-bit result is just a 64KiB? look-up table.

deepnotderp 2 days ago

parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

Inference only, 8-bit integer won't work for training without bad accuracy degeneration.

shepardrtc 2 days ago

parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

Here's a paper they published a little while ago about limited numerical precision and deep learning:

https://arxiv.org/abs/1502.02551

" Training of large-scale deep neural networks is often constrained by the available computational resources. We study the effect of limited precision data representation and computation on neural network training. Within the context of low-precision fixed-point computations, we observe the rounding scheme to play a crucial role in determining the network's behavior during training. Our results show that deep networks can be trained using only 16-bit wide fixed-point number representation when using stochastic rounding, and incur little to no degradation in the classification accuracy. We also demonstrate an energy-efficient hardware accelerator that implements low-precision fixed-point arithmetic with stochastic rounding. "

nickpsecurity 2 days ago

parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

Regarding 8-bit numbers, here's a thread on why 8 bits are enough and an old product that used that to good effect:

https://news.ycombinator.com/item?id=10244398

http://www.eetimes.com/document.asp?doc_id=1140287

It's something that keeps getting rediscovered. I know embedded industry shoehorns all kinds of problems into 8- and 16-bitters. Some even use 4-bit MCU's. Might be worthwhile if someone does a survey of all the things you can handle easily or without too much work in 8-16-bit cores. This might help for people building systems out of existing parts or people trying to design heterogenous SOC's.

https://news.ycombinator.com/item?id=10244398 Why Are Eight Bits Enough for Deep Neural Networks? (petewarden.com) (see below for notes)

http://www.eetimes.com/document.asp?doc_id=1140287 Startup implements silicon neural net in Learning Processor "an array of 256 simple 8-bit RISC processors connected in parallel on a single chip."

" Axeon claims to have achieved that by dividing the generalized neural network into a hierarchy of modules. As a result, the computational and communication overheads are limited to the size of the neural-network module. At the same time, the neural network's performance can be scaled with the absolute number of modules, and therefore neurons, combined in the hierarchy.

Lightowler has further enhanced the system's parallelism and density by replacing the multiplier, which is expensive in terms of die area, with a simpler arithmetic shifter, providing processors with local memory, arranging them in a single-instruction multiple-data stream array and creating an external module controller. The module controller calculates and stores global information, passes data and instructions to the array, and passes I/O to the external world asynchronously.

Simulations show that Axeon's Learning Processor should be capable of 2.4 giga connections per second when running at 100 MHz with an average training time of 0.45 seconds.

"The prototype has a separate controller implemented as an FPGA, but in subsequent iterations that will be integrated into the ASIC," said Grant. "

---

zackmorris 2 days ago

parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

While this is interesting for TensorFlow?, I think that it will not result in more than an evolutionary step forward in AI. The reason being that the single greatest performance boost for computing in recent memory was the data locality metaphor used by MapReduce?. It lets us get around CPU manufacturers sitting on their hands and the fact that memory just isn’t going to get substantially faster.

I'd much rather see a general purpose CPU that uses something like an array of many hundreds or thousands of fixed-point ALUs with local high speed ram for each core on-chip. Then program it in a parallel/matrix language like Octave or as a hybrid with the actor model from Erlang/Go. Basically give the developer full control over instructions and let the compiler and hardware perform those operations on many pieces of data at once. Like SIMD or VLIW without the pedantry and limitations of those instruction sets.

---

zackmorris 2 days ago

parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

While this is interesting for TensorFlow?, I think that it will not result in more than an evolutionary step forward in AI. The reason being that the single greatest performance boost for computing in recent memory was the data locality metaphor used by MapReduce?. It lets us get around CPU manufacturers sitting on their hands and the fact that memory just isn’t going to get substantially faster.

I'd much rather see a general purpose CPU that uses something like an array of many hundreds or thousands of fixed-point ALUs with local high speed ram for each core on-chip. Then program it in a parallel/matrix language like Octave or as a hybrid with the actor model from Erlang/Go. Basically give the developer full control over instructions and let the compiler and hardware perform those operations on many pieces of data at once. Like SIMD or VLIW without the pedantry and limitations of those instruction sets.

---

saosebastiao 2 days ago

parent flag favorite on: An In-Depth Look at Google's Tensor Processing Uni...

Are people really using models so big and complex that the parameter space couldn't fit into an on-die cache? A fairly simple 8MB cache can give you 1,000,000 doubles for your parameter space, and it would allow you to get rid of an entire DRAM interface. It's a serious question, as I've never done any real deep learning...but coming from a world where I once scoffed at a random forest model with 80 parameters, it just seems absurd.

mattnewton 2 days ago [-]

Yes. Each layer can have millions of parameters if your data set is large enough.

Convolutional networks easily get up there, especially if you add a third dimension that the network can travel across (either space in 3D covnets for medical scans, or time for videos in some experimental archetecture). Say you want to look at a heart in a 3D covnet, that could easily be 512x512x512 for the input alone.

In fully connected models, for training efficiency, many features are implemented as one-hot encoded parameters, which turns a single caragory like "state" into 50 parameters. I think there is some active research into sparse representations of this with the same efficiency but I've never seen those techniques, just people piling on more parameters.

reply

 vincentchu 2 days ago | parent | flag | favorite | on: An In-Depth Look at Google's Tensor Processing Uni...

The latest deep learning models are indeed quite large. For comparison, Inception clocks in at "only" 5M parameters, itself a 12x reduction over AlexNet? (60M) and VGGNet (180M)! (source: https://arxiv.org/abs/1512.00567)

A further point is that even if the model has relatively few parameters, there are advantages to having more memory--- namely, you can do inference on larger batch sizes in one go.

---

https://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-deep-neural-networks/ https://news.ycombinator.com/item?id=10244398 Why Are Eight Bits Enough for Deep Neural Networks? (petewarden.com) (see below for notes)

LoSboccacc? 566 days ago [-]

did my thesis on this topic (at that time we were searching the lower bound of ALU needed to have them running in zero power devices)

it's interesting, NN degrade at about 6bit, and that's mostly because the transfer function become stable and the training gets stuck more often in local minimums.

we built a training methodology in two step, first you trained them in 16bit precision, finding the absolute minimum, then retrain them with 6bit precision, and the NN basically learned to cope with the precision loss on its own.

funny part is, the less bit you have, the more robust the network became, because error correcting became a normal part of its transfer function.

we couldn't make the network solution converge on 4bit however. we tried using different transfer function, but then ran out of time before getting meaningful results (Each function needs it's own back propagation adjustment and things like that take time, I'm not a mathematician :D)

fgimenez 566 days ago [-]

I had similar empirical results on one of my PhD? projects for medical image classification. With small data sets, we got better results on 8-bit data sets compared to 16-bit. We viewed it as a form of regularization that was extremely effective on smaller data sets with a lot of noise (x-rays in this case).

tachyonbeam 565 days ago [-]

When using 8-bit weights, what kind of mapping do you do? Do you map the 8-bit range into -10 to 10? Do you have more precision near zero or is it a linear mapping?

LoSboccacc? 565 days ago [-]

Don't know about him but I was working with -8 8 for input and -4 4 for weights, using atan function for transfer maps quite well and there is no need to oversaturate the next layer.

Houshalter 566 days ago [-]

The problem with using digital calculations is that they are deterministic. If a result is really small, it is just rounded down to zero. So if you add a bunch of small numbers, you get zero. Even if the result should be large.

Stochastic rounding can fix this. You round each step with the probability so it's expected value is the same. Usually it will round down to 0, but sometimes it will round up to 1.

Relevant paper, using stochastic rounding. Without it the results get worse and worse before you even get to 8 bits. With stochastic rounding, there is no performance degradation. You could probably even reduce the bits even further. I think it may even be possible to get it down to 1 or 2 bits: http://arxiv.org/abs/1502.02551

The relevant graph: https://i.imgur.com/cOZ4fn3.jpg

hyperion2010 566 days ago [-]

Point of interest, if you do the fundamental physics on neuronal membranes, the number of levels that are actually distinguishable give the noise in the system is only about 1000. So even in a biological system there are only 4x the the number of discrete levels. I realize this isn't a good match to what is mentioned in the article but it does put some constraints on the maximum dynamic range that biological sensors have to work within.

tajen 566 days ago [-]

1024 levels = 10 bits. The article mentions 8 bits, which is 256 levels. Now I get what you mean with 4x.

rdlecler1 566 days ago [-]

These networks ought to be robust to minor changes in W. It's the topology that maters and frankly most of the W_ij != 0 are spurious connections -- meaning perturbation analysis will show that they play no causal role in the computation. I wrote a paper on this which has >100 citation (Survival of The Sparsest: Robust Gene Networks are Parsimonious). I used gene networks, but this is just a special case of neural networks. In fact there been a bunch of papers published on gene regulatory networks that show that topology is the main driver of function -- not surprising, if you show the circuit diagram of an 8-bit adder to an EE, they'll know exactly the function. Logically it has to be so. In fact you can model the gene network of the drosophila segmentation pattern with Boolean (1-bit) networks. The problem with ANN research is that no few take the time to understand why things function as they do. We should be reverse engineering these from biology. Every time a major advancement is made in ANNs neurobiologist say "yes, we could have told you that ten years ago" deep learning is just the latest example. It will hit its asymptote soon, then people will say that AI failed to live up to its expectation, then someone will make a new discovery. It's very frustrating to sit on the sidelines and watch this happen again and again.

JuliaLang? 566 days ago [-]

Care to do it yourself?

TD-Linux 566 days ago [-]

>On the general CPU side, modern SIMD instruction sets are often geared towards float, and so eight bit calculations don’t offer a massive computational advantage on recent x86 or ARM chips.

This isn't true, modern SIMD instruction sets have tons of operations for smaller fixed point numbers, as used heavily in video codecs. Unless the author meant some sort of weird 8 bit float?

afsina 566 days ago [-]

Funny that I just finished initial implementation of the code that uses the techniques from the paper (Vanhoucke et al.) mentioned in the post.

https://github.com/ahmetaa/fast-dnn

dnautics 566 days ago [-]

Agreed. I'm going working on an 8bit floating point that is optimized for learning algos, and optimized to be easy to soft emulate and also very efficient in hardware. One of the cool things about this float is that transfer functions (like the logistic) basically becomes a lookup table for really good performance.

Also, there is no strong need for "zero".

Animats 566 days ago [-]

That's fascinating, especially since very slow training, where the weights don't change much per cycle, is in fashion. One would think that would result in changes rounding down to zero and nothing happening, but apparently it doesn't.

jokoon 566 days ago [-]

Wouldn't that mean that using 8 bit cores would be enough to simulate neuron networks ? That might significantly reduce the amount of transistors, thus increasing the amount of cores and parallelism.

---

tangentially interesting (a 'high-end' target: https://devblogs.nvidia.com/parallelforall/inside-volta/ )

---

DonbunEf?7 8 days ago [-]

I want to buy RISC-V, both to play with and to support the cause. What are my options like and should I buy something now or wait for the next generation?

reply

ktta 8 days ago [-]

Apart from buying an actual dev kit like HiFive? from SiFive?, a great way to get into it is to buy SiFive?'s FPGA kits

https://dev.sifive.com/freedom-soc/evaluate/fpga/

The FPGA board can be used for other things too. If you are really adventurous, I'd suggest buying a FPGA board with better chip so you can fit in larger IP blocks in the future. It will work perfectly fine as a replacement for the above FPGA kit I've linked to.

My suggestion would be this[1]. It has a pretty large LUT count so you can go nuts. The RAM and Ethernet will be pretty useful if you want to run linux[2] and test out stuff. It'll be a bit hard to run linux on it right now.

On the other hand, you can choose a Parallela board[3] which comes with a FPGA chip along with a new (soon to be retired) arch called epiphany. Here[4] is a GSoC? project which runs linux on that board using the FPGA.

[1]: http://store.digilentinc.com/nexys-4-ddr-artix-7-fpga-traine... [2]: https://github.com/riscv/riscv-linux [3]: https://www.parallella.org/ [4]: https://github.com/eliaskousk/parallella-riscv http://store.digilentinc.com/nexys-4-ddr-artix-7-fpga-trainer-board-recommended-for-ece-curriculum/ https://github.com/riscv/riscv-linux https://github.com/eliaskousk/parallella-riscv reply

e12e 8 days ago [-]

I see that what they recommend for the freedom platform

https://www.sifive.com/products/freedom/

today, is a dev board

https://dev.sifive.com/freedom-soc/evaluate/fpga/

that costs around 3500 USD:

https://www.avnet.com/shop/us/p/kits-and-tools/development-k... https://www.avnet.com/shop/us/p/kits-and-tools/development-kits/xilinx/ek-v7-vc707-g-3074457345626227804/

Do anyone here know what kind of "classic pc" performance one is likely to get out of a board like that? Could you get on the order of low-end pc (~500 dollar soc / netbook with real gigabit ethernet, sata6 and usb3) from something like that, if paired up with a reasonable cpu design?

reply

brucehoult 7 days ago [-]

The $3500 board is a lot, yes. There is also a $99 "Arty" board that is enough to run what is in the HiFive?1, and I think even to add an MMU and run Linux. The $3500 one has enough space to do multiple cores, FPU and so forth as well.

Both the $99 and $3500 FPGA boards run a Rocket CPU at 65 MHz, which is pretty slow, though faster than you get with a software cycle-accurate simulator on a workstation.

reply

---

" Details on the Arduino Cinque are slim at the moment, but from what we’ve seen so far, the Cinque is an impressively powerful board featuring the RISC-V FE310 SoC? from SiFive?, an ESP32, and an STM32F103. The STM32 appears to be dedicated to providing the board with USB to UART translation, something the first RISC-V compatible Arduino solved with an FTDI chip. ... We’ve taken a look at SiFive’s? FE310 SoC?, and it is an extremely capable chip. It was released first at the HiFive?1, and our hands-on testing revealed this is a chip that outperforms the current performance champ of the Arduino world, the Teensy 3.6. ... 320MHz, so about on par with a cheap wireless router. ...With 3 processors, it’s probably not going to be dirt cheap. ... I’m pretty sure the FE310 has 16kb of ram, 16kb of instruction cache and uses off-chip QSPI flash. "

" Like the HiFive?1, the Arduino Cinque features SiFive’s? MCU-like Freedom E310, the first commercially available RISC-V SoC?. The 320MHz SoC? “is one of the fastest microcontrollers available in the market,” says SiFive?.

The Arduino Cinque appears to have the same footprint and pin configuration as the HiFive?1, as well as a similar layout of the processor, micro-USB port, power jack, and wake and reset buttons. Other components differ, however.

Aside from its fast, open source processor, the HiFive?1 is a fairly standard Arduino compatible. The 68 x 51mm board features 128Mbit off-chip SPI flash, 19x digital I/O pins, 9x PWM pins, an SPI controller, and 3x hardware CS pins. ... The HiFive?1 is further equipped with a wakeup pin and 19x interrupt pins. A micro-USB port can be used for programming, debug, and serial communications, in addition to providing 5V power. The board can also draw power from a 7-12 DC input jack.

The HiFive?1 can be programmed with the Arduino IDE, and ships with an open source Freedom E SDK that supports FreeRTOS?. The Freedom E SDK page on GitHub? mentions IDE support for Ubuntu.

The FE310 SoC? that drives the HiFive?1 and Cinque boards is equipped with 16KB L1, a 16KB Data SRAM scratchpad, and “hardware multiply/divide.” There’s also a debug module, “flexible clock generation with on-chip oscillators and PLLs,” and I/O support including UARTs, QSPI, PWMs, and timers.

SiFive? is also selling the FE310 outright under an open source license, letting customers download their own RTL (Register Transfer Logic) onto the chips. However, the company is primarily building a “chips-as-a-service” customization business.

By combining a RISC-V with Espressif’s ESP32, Arduino and SiFive? have served up two of the leading Cinderella stories in computing over the last few years. The ESP32 wireless SoC? is a higher-end sibling to the ESP8266, and appears to be every bit as popular. It similarly supports either standalone operation or use as a slave device, for example as a subsystem incorporated into an Arduino board.

Espressif ESP32

Unlike the ESP8266, the ESP32 provides dual-mode Bluetooth 4.2 with legacy classic and LE (low energy) support. The SoC? also offers faster, up to 150Mbps HT40 (40MHz channel width) 2.4GHz WiFi? compared to the previous HT20 WiFi?. "

" Also at Maker Faire Bay Area, Arduino showcased its new Arduino LoRa? Gateway and LoRa? Node shields that run on Arduino boards. Due to arrive later this year, the boards will be offered in a LoRa? Gateway Shield Kit for the Linino Linux-enabled Arduino Tian, and a LoRa? Node Shield Kit designed for the Arduino Primo or other Arduinos with at least 32KB of flash. "

---

https://en.wikipedia.org/wiki/Motorola_MC14500B

ISA:

Table 1. MC14500B Instruction Set

NOPO. No change in registers. RR -> RR, Flag O -> LD. Load result register. Data -> RR LDC. Load complement. Data-> RR AND. Logical AND. RR Data -> RR ANDC. Logical AND complement. RR Data-> RR OR. Logical OR. RR + Data -> RR ORC. Logical OR complement. RR + Data-> RR XNOR. Exclusive NOR. If RR = Data, RR -> 1 STO. Store. RR -> Data Pin, Write -> STOC. Store complement. RR-> Data Pin, Write -> IEN. Input enable. Data -> IEN Register OEN. Output enable. Data -> OEN Register JMP. Jump. JMP Flag -> RTN. Return. RTN Flag -> and skip next instruction SKZ. Skip next instruction if RR = 0 NOPF. No change in registers. RR -> RR, Flag F ->

---

http://www.cardwerk.com/smartcards/smartcard_technology.aspx

" The micromodule has eight metallic pads on its surface, each designed to international standards for VCC (power supply voltage), RST (used to reset the microprocessor of the smart card), CLK (clock signal), GND (ground), VPP (programming or write voltage), and I/O (serial input/output line). Two pads are reserved for future use (RFU). Only the I/O and GND contacts are mandatory on a card to meet international standards; the others are optional. ... Smart cards are always reset when they are inserted into a CAD. This action causes the smart card to respond by sending an "Answer-to-Reset " (ATR) message, which informs the CAD, what rules govern communication with the card and the processing of a transaction. ... Typically, older version smart cards are based on relatively slow, 8-bit embedded microcontrollers. The trend has been toward using customized controllers with a 32-bit Reduced Instruction Set Computing (RISC) processor running at 25 to 32 MHz. The I/O Controller manages the flow of data between the Card Acceptance Device (CAD) and the microprocessor. ... Error Correction: Current Chip Operating Systems (COS) perform their own error checking. The terminal operating system must check the two-byte status codes returned by the COS (as defined by both ISO 7816 Part 4 and the proprietary commands) after the command issued by the terminal to the card. The terminal then takes any necessary corrective action.

Storage Capacity: EEPROM: 8K - 128K bit. (Note that in smart card terminology, 1K means one thousand bits, not one thousand 8-bit characters. One thousand bits will normally store 128 characters - the rough equivalent of one sentence of text. However, with modern data compression techniques, the amount of data stored on the smart card can be significantly expanded beyond this base data translation.) ... First Time Read Rate: ISO 7816 limits contact cards to 9600 baud transmission rate; some Chip Operating Systems do allow a change in the baud rate after chip power up; a well designed application can often complete a card transaction in one or two seconds. Speed of Recognition Smart cards are fast. Speed is only limited by the current ISO Input/Output speed standards. ... Processing Power: Older version cards use an 8-bit micro-controller clockable up to 16 MHz with or without co-processor for high-speed encryption. The current trend is toward customized controllers with a 32-bit RISC processor running at 25 to 32 MHz.

Power Source: 1.8, 3, and 5 volt DC power sources.

Support Equipment Required for Most Host-based Operations: Only a simple Card Acceptance Device (that is, a card reader/writer terminal) with an asynchronous clock, a serial interface, and a 5-volt power source is required. For low volume orders, the per unit cost of such terminals runs about $150. The ... COS Standards - Although smart cards conform to a set of international standards, there is currently no standard Chip Operating System (COS), or anything as common as Microsoft's Windows, or UNIX. Each smart card vendor provides the market with a distinct product. The key discriminator among smart card products is the proprietary operating system each offers to the customer. "

some examples of other negotiable baud rates are at: https://electronics.stackexchange.com/questions/282313/is-this-possible-to-transmit-higher-data-rate-with-9600-bdrate-clock-in-iso7816 . i don't understand it, but in the example, 111875 baud is reached

https://en.wikipedia.org/wiki/EMV#Commands gives the following subset of commands as part of ISO/IEC 7816-4, "interindustry commands used for many chip card applications such as GSM SIM cards":

external authenticate (7816-4) get data (7816-4) internal authenticate (7816-4) read record (7816-4) select (7816-4) verify (7816-4)

these can be read about here: http://www.cardwerk.com/smartcards/smartcard_standard_ISO7816-4_6_basic_interindustry_commands.aspx

this page is from 2003: " Traditionally this is an 8-bit microcontroller but increasingly more powerful 16 and 32-bit chips are being used. However, none have multi-threading and other powerful features that are common in standard computers. Smart Card CPUs execute machine instructions at a speed of approximately 1 MIPS. A coprocessor is often included to improve the speed of encryption computations. ... There are three main types of memory on cards:

    RAM. 1K. This is needed for fast computation and response. Only a tiny amount is available.
    EEPROM (Electrically Erasable PROM). Between 1 to 24K. Unlike RAM, its contents are not lost when power is. Applications can run off and write to it, but it is very slow and one can only read/write to it so many (100 000) times.
    ROM. Between 8 to 24K. The Operating System and other basic software like encryption algorithms are stored here.

Input/Output This is via a single I/O port that is controlled by the processor to ensure that communications are standardized, in the form of APDUs (A Protocol Data Unit). Interface Devices (IFDs) Smart Cards need power and a clock signal to run programs, but carry neither. Instead, these are supplied by the Interface Device - usually a Smart Card Reader - in contact with the card. This obviously means that a Smart Card is nothing more than a storage device while being warmed in your pocket.

In addition to providing the power and clock signals, the reader is responsible for opening a communication channel between application software on the computer and the operating system on the card. Nearly all Smart Card readers are actually reader/writers, that is, they allow an application to write to the card as well as read from it.

The communication channel to a Smart Card is half-duplex. This means that data can either flow from the IFD to the card or from the card to the IFD but data cannot flow in both directions at the same time. The receiver is required to sample the signal on the serial line at the same rate as the transmitter sends it in order for the correct data to be received. This rate is known as the bit rate or baud rate. Data received by and transmitted from a Smart Card is stored in a buffer in the Smart Card's RAM. As there isn't very much RAM, relatively small packets (10 - 100 bytes) of data are moved in each message.

Here is a selection of parameters from some of the smart cards on the market today. They are neither the biggest nor the fastest; that is reserved for Java cards. The reason for this is price --- smart cards like these are programmed in assembly language and do not need much in the way of resources. To keep down costs, they don't get resources. Smart Card Word size ROM EEPROM RAM Voltage Clock Write/erase cycles Transmission rate Infineon SLE 44C10S 8-bit 9K 1K 256b 2.7 - 5.5V 5 MHz 500 000 9600 baud Orga ICC4 8-bit 6K 3K 128b 4.7 - 5.3V 10 000 GemCombi? 8-bit 5K 4.5 - 5.5V 13.6 MHz 100 000 106 kbaud DNP Risona 8-bit 1K 5V 3.5 MHz 9600 baud AmaTech? Contactless 8-bit 1K 5V 13.6 MHz 100 000 cycles Schlumberger Cyberflex 8/16-bit 8K 16K 256b 5V 1-5 MHz 100 000 cycles 9600 baud ... Operating Systems The operating system found on the majority of Smart Cards implements a standard set of commands (usually 20 - 30) to which the Smart Card responds. Smart Card standards such as ISO 7816 and CEN 726 describe a range of commands that Smart Cards can implement. Most Smart Card manufacturers offer cards with operating systems that implement some or all of these standard commands (and possibly extensions and additions). Most SmartCards? are currently programmed in low-level languages based on proprietary SmartCard? operating systems. Some of the programming has been done in the chip's native instruction set (generally Motorola 6805, Intel 8051, or Hitachi H8). Not many programmers are capable of this.

In 1998- 2000, a new type of card has shown up, sometimes called a re-configurable card. These have a more robust operating system that permits the addition or deletion of application code after the card is issued. Such cards are generally programmed in Java and are therefore called Java Cards. Other relatively popular languages relate to Windows for SmartCards? or MEL (the Multos programming language) or even Basic. Although memory-efficient programming will still be essential, this greatly increases the pool of programmers capable of creating software for Smart Cards. " -- [10]

" Currently, the most notable operating systems on the market are:

    JavaCard for program development using Java.
    MULTOS is the first, open, high security, multi-application operating system for smart cards. MULTOS allows you to dynamically load, update, or delete any application during the life of the card.

" -- [11]

from 2007:

" 8-bit CISC / RISC • 6805 / 8051 / Z80 / H8 / AVR ... ROM ...Capacity 8- 512 KB ... EEPROM...Capacity 1- 512 KB ... RAM...Capacity 128- 16384 Bytes and increasing

" [12]

"The YubiKey? NEO is a two-chip design. There is one “non-secure” USB interface controller and one secure crypto processor, which runs Java Card (JCOP 2.4.2 R1). ... The YubiKey? 4 is a single-chip design without a Java Card/Global Platform environment, featuring RSA with key lengths up to 4096 bits and ECC up to 521 bits. Yubico has developed the firmware from the ground up. These devices are loaded by Yubico and cannot be updated. ... As we began to produce the NEO in larger volumes, we had to make some tough choices:...Given that the NXP toolchain and extended libraries for JCOP are not free and available, applet development becomes more a theoretical possibility than a practical one. "

-- https://www.yubico.com/2016/05/secure-hardware-vs-open-source/

---

" Basically, the HiFive? 1 is the SiFive? FE310 microcontroller packaged in an Arduino Uno form factor. The pin spacing is just as stupid as it’s always been, and there is support for a few Adafruit shields sitting around in the SDK.

There are no analog pins, but there are two more PWM pins compared to the standard Arduino chip. The Arduino Uno and Leonardo have 32 kilobytes of Flash, while the HiFive? 1 has sixteen Megabytes of Flash on an external SOIC chip. "

" the HiFive? 1 is fast. Really, really fast. Right now, if you want to build a huge RGB LED display, you have one good option: the Teensy 3.6. If you need a microcontroller to pump a lot of data out, the Teensy has the power, the memory, and the libraries to do it easily. In this small but very demanding use case, the HiFive? 1 might be better. The HiFive? 1 has more Flash (although it’s an SPI Flash), it has DMA, and it has roughly twice the processing power as the Teensy 3.6. "

" Arduino Cinque board equipped with SiFive? Freedom E310 processor, ESP32 for WiFi? and Bluetooth, and an STM32 ARM MCU to handle programming. "

" MCU – SiFive? Freedom E310 (FE310) 32-bit RV32IMAC processor @ up to 320+ MHz (1.61 DMIPS/MHz) WiSoC? – Espressif ESP32 for WiFi? and Bluetooth 4.2 LE Storage – 32-Mbit SPI flash "

" the big story here is the Openness of the HiFive? 1. Is it completely open? No. the HiFive? 1 itself uses an FTDI chip, and I’ve heard rumor and hearsay the FE310 chip has proprietary bits that are ultimately inconsequential to the function of the chip. ...Nevertheless, this is the best we have so far, and it is only the beginning. "

" The HiFive? 1 supports 3.3 and 5V I/O, thanks to three voltage level translators. The support for 5V logic is huge in my opinion — nearly every dev board manufacturer has already written off 5V I/O as a victim of technological progress. The HiFive? doesn’t, even though the FE310 microcontroller is itself only 3.3V tolerant. It should be noted the addition of the voltage level translators add at least a dollar or two to the BOM, and double that to the final cost of the board. It’s a nice touch, but there’s room for cost cutting here. "

" There’s no availability nor price information, but considering HiFive?1 board is now sold for $59, and Arduino Cinque may cost about the same or a little more once it is launched since it comes with an extra ESP32 chip, but a smaller SPI flash. Hopefully, it will take less time than the one year gap experienced between the announcement and the release of Arduino Due. "

"

May 22nd, 2017 at 14:55

Reply
#1
Quote

59 Dollar is really expensive for a 300 Mhz Process Board. willmore May 22nd, 2017 at 18:59

Reply
#2
Quote

Is that an STM32 just as the USB interface?

That board is crazy wasteful of resources. Any one of the chips on there would have been quite enough. "

" Details on the Arduino Cinque are slim at the moment, but from what we’ve seen so far, the Cinque is an impressively powerful board featuring the RISC-V FE310 SoC? from SiFive?, an ESP32, and an STM32F103. The STM32 appears to be dedicated to providing the board with USB to UART translation, something the first RISC-V compatible Arduino solved with an FTDI chip. Using an FTDI chip is, of course, a questionable design decision when building a capital ‘O’ Open microcontroller platform, and we’re glad SiFive? and Arduino found a better solution. It’s unknown if this STM32 can be used alongside the FE310 and ESP32 at this point. "

" STM32F103 devices use the Cortex-M3 core, with a maximum CPU speed of 72 MHz. The portfolio covers from 16 Kbytes to 1 Mbyte of Flash with motor control peripherals, USB full-speed interface and CAN. "

" The family of STM32 F103x x Microcontrollers consists of ARM Cortex - M3 32 - bit RISC core , high speed embedded memories (Flash memory is up to 128 Kbytes and Static Random Access Memory (SRAM) is up to 20 Kbytes) , I/Os (Input/Output) and peripherals which they are cooperating together by connecting to two APB (Advanced Peripheral Bus) buses. "

" Baltic Engineering Co. has decided to use the microcontroller STM32 f103 in the most projects because this MCU STM32 f103 has many high level features in compare of other microcontrollers. It is one of the best in class 32 bit MCU , best performance to control and connectivity in electronics projects, it is able to perform in DSP (Digital Signal Processor) solutions (High frequency performance), has low power application in order to save power for system, the spee d of peripheral is increased for the better performance etc. "

" The heart of the HiFive?1 is SiFive’s? FE310 SoC?, a 32-bit RISC-V core running at 320+ MHz." "

---

" RISC-V ISA Overview The RISC-V ISA is defined as a base integer ISA, which must be present in any implementation, plus optional extensions to the base ISA. The base integer ISA is very similar to that of the early RISC processors except with no branch delay slots and with support for optional variable-length instruction encodings. The base is carefully restricted to a minimal set of instructions sufficient to provide a reasonable target for compilers, assemblers, linkers, and operating systems (with additional supervisor-level operations), and so provides a convenient ISA and software toolchain “skeleton” around which more customized processor ISAs can be built ... Each base integer instruction set is characterized by the width of the integer registers and the corresponding size of the user address space. There are two primary base integer variants, RV32I and RV64I, described in Chapters 2 and 4, which provide 32-bit or 64-bit user-level address spaces respectively. Hardware implementations and operating systems might provide only one or both of RV32I and RV64I for user programs. Chapter 3 describes the RV32E subset variant of the RV32I base instruction set, which has been added to support small microcontrollers ... The base RISC-V ISA has fixed-length 32-bit instructions that must be naturally aligned on 32-bit boundaries. However, the standard RISC-V encoding scheme is designed to support ISA extensions with variable-length instructions, where each instruction can be any number of 16-bit instruction parcels in length and parcels are naturally aligned on 16-bit boundaries. ... We chose little-endian byte ordering for the RISC-V memory system because little-endian sys- tems are currently dominant commercially (all x86 systems; iOS, Android, and Windows for ARM). A minor point is that we have also found little-endian memory systems to be more nat- ural for hardware designers ... Chapter 2 RV32I Base Integer Instruction Set, Version 2.0

This chapter describes version 2.0 of the RV32I base integer instruction set. Much of the commen- tary also applies to the RV64I variant. RV32I was designed to be sufficient to form a compiler target and to support modern operating system environments. The ISA was also designed to reduce the hardware required in a mini- mal implementation. RV32I contains 47 unique instructions, though a simple implementation might cover the eight SCALL/SBREAK/CSRR* instructions with a single SYSTEM hardware instruction that always traps and might be able to implement the FENCE and FENCE.I in- structions as NOPs, reducing hardware instruction count to 38 total. RV32I can emulate almost any other ISA extension (except the A extension, which requires additional hardware support for atomicity).

...

Figure 2.1 shows the user-visible state for the base integer subset. There are 31 general-purpose registers x1 – x31 , which hold integer values. Register x0 is hardwired to the constant 0. There is no hardwired subroutine return address link register, but the standard software calling convention uses register x1 to hold the return address on a call. For RV32, the x registers are 32 bits wide, and for RV64, they are 64 bits wide. This document uses the term XLEN to refer to the current width of an x register in bits (either 32 or 64). There is one additional user-visible register: the program counter pc holds the address of the current instruction. The number of available architectural registers can have large impacts on code size, performance, and energy consumption. Although 16 registers would arguably be sufficient for an integer ISA running compiled code, it is impossible to encode a complete ISA with 16 registers in 16-bit instructions using a 3-address format. Although a 2-address format would be possible, it would increase instruction count and lower efficiency. We wanted to avoid intermediate instruction sizes (such as Xtensa’s 24-bit instructions) to simplify base hardware implementations, and once a 32-bit instruction size was adopted, it was straightforward to support 32 integer registers. A larger number of integer registers also helps performance on high-performance code, where there can be extensive use of loop unrolling, software pipelining, and cache tiling. For these reasons, we chose a conventional size of 32 integer registers for the base ISA. Dy- namic register usage tends to be dominated by a few frequently accessed registers, and regfile im- plementations can be optimized to reduce access energy for the frequently accessed registers [26]. The optional compressed 16-bit instruction format mostly only accesses 8 registers and hence can provide a dense instruction encoding, while additional instruction-set extensions could support a much larger register space (either flat or hierarchical) if desired. For resource-constrained embedded applications, we have defined the RV32E subset, which only has 16 registers (Chapter 3). " -- [13]

---

https://www.qualcomm.com/news/onq/2017/04/13/artificial-intelligence-tech-snapdragon-835-personalized-experiences-created

---

http://www.st.com/en/secure-mcus/st31-arm-sc000.html?querycriteria=productId=SC1617

secure MCU for smartcards. Cortex-M0 ARM 32-bit CPU.

RAM from 8-12k ROM or flash from 192-480, EEPROM from 16-40k

---

"With the Attitude Adjustment (12.09) release of OpenWrt?, all hardware devices with 16 MB or less RAM are no longer supported as they can run out of memory easily. " [14]

---

" CGRAs consist of an array of a large number of function units (FUs) interconnected by a mesh style network. Register files are distributed throughout the CGRAs to hold temporary values and are accessible only by a subset of FUs. The FUs can execute common word-level operations, including addition, subtraction, and multiplication. In contrast to FPGAs, CGRAs have short reconfiguration times, low delay characteristics, and low power consumption as they are constructed from standard cell implementations. Thus, gate-level reconfigurability is sacrificed, but the result is a large increase in hardware efficiency.

A good compiler is essential for exploiting the abundance of computing resources available on a CGRA. However, sparse connectivity and distributed register files present difficult challenges to the scheduling phase of a compiler. "

---

Intel discontinues Joule, Galileo, and Edison product lines (hackaday.com)

edmundhuber 10 hours ago [-]

Why did Edison fail:

The new hotness are the Espressif (ESP32) and MediaTek? (mt7697) SoCs?.

reply

---

" The baseband market is currently going through a major shift: If, several years ago, Qualcomm were the unchallenged market leaders, today the market has split up into several competitors. Samsung’s Shannon modems are prevalent in most of the newer Samsungs; Intel’s Infineon chips have taken over Qualcomm as the baseband for iPhone 7 and above; and MediaTek’s? chips are a popular choice for lower cost Androids. And to top it off, Qualcomm is still dominant in higher end non-Samsung Androids. "

---

" All the BCM chips that we’ve observed run an ARM Cortex-R4 microcontroller. One of the system’s main quirks is that a large part of the code runs on the ROM, whose size is 900k. Patches, and additional functionality, are added to the RAM, also 900k in size. In order to facilitate patching, an extensive thunk table is used in RAM, and calls are made into that table at specific points during execution. Should a bug fix be issued, the thunk table could be changed to redirect to the newer code.

In terms of architecture, it would be correct to look at the BCM43xx as a WiFi? SoC?, since two different chips handle packet processing. While the main processor, the Cortex-R4, handles the MAC and MLME layers before handing the received packets to the Linux kernel, a separate chip, using a proprietary Broadcom processor architecture, handles the 802.11 PHY layer. Another component of the SoC? is the interface to the application processor: Older BCM chips used the slower SDIO connection, while BCM4358 and above use PCIe. "

---

https://github.com/Dolu1990/FreeRTOS-RISCV

---

http://hackaday.com/2017/07/21/vexriscv-a-modular-risc-v-implementation-for-fpga/

---

"gcc is made to be portable as long as your architecture fits some predefined notions (for example, at least 32 bit integers and a flat address space)."

---

" We see increasing interest in Intel ME internals from researchers all over the world. One of the reasons is the transition of this subsystem to new hardware (x86) and software (modified MINIX as an operating system). "

" The x86 platform allows researchers to make use of the full power of binary code analysis tools. Previously, firmware analysis was difficult because earlier versions of ME were based on an ARCompact microcontroller with an unfamiliar set of instructions. "

" Intel ME 11 architecture overview Starting with the PCH 100 Series, Intel has completely redesigned the PCH chip. The architecture of embedded microcontrollers was switched from ARCompact by ARC to x86. The Minute IA (MIA) 32-bit microcontroller was chosen as the basis; it is used in Intel Edison microcomputers and SoCs? Quark and based on a rather old scalar Intel 486 microprocessor with the addition of a set of instructions (ISA) from the Pentium processor. "

" Such an overhaul required changing ME software as well. In particular, MINIX was chosen as the basis for the operating system (previously, ThreadX? RTOS had been used). Now ME firmware includes a full-fledged operating system with processes, threads, memory manager, hardware bus driver, file system, and many other components. A hardware cryptoprocessor supporting SHA256, AES, RSA, and HMAC is now integrated into ME. User processes access hardware via a local descriptor table (LDT). The address space of a process is also organized through an LDT—it is just part of the global address space of the kernel space whose boundaries are specified in a local descriptor. Therefore, the kernel does not need to switch between the memory of different processes (changing page directories), as compared to Microsoft Windows or Linux, for instance. "

---

http://lists.dragonflybsd.org/pipermail/users/2017-August/313558.html

"The allocator has a 16KB granularity (on HAMMER1 it was 2MB)...Allocations down to 1KB are supported. The freemap has a 16KB granularity with a linear counter (one counter per 512KB) for packing smaller allocations. INodes are 1KB and can directly embed 512 bytes of file data for files...The blockrefs are 'fat' at 128 bytes but enormously powerful. That will allow us to ultimately support up to a 512-bit crypto hash and blind dedup using said hash...Filenames up to 64 bytes long can be accomodated in the blockref using the check-code area of the blockref. Longer filenames will use an additional data reference hanging off the blockref to accomodate up to 255 char filenames. Of course, a minimum of 1KB will have to be allocated in that case, but filenames are <= 64 bytes in the vast majority of use cases so it just isn't an issue. ...

---

" A STM32F303VCT6 microcontroller. This microcontroller has

    A single core ARM Cortex-M4F processor with hardware support for single precision floating point operations and a maximum clock frequency of 72 MHz.
    256 KiB of "Flash" memory. (1 KiB = 1024 bytes)
    48 KiB of RAM.
    many "peripherals": timers, GPIO, I2C, SPI, USART, etc.
    lots of "pins" that are exposed in the two lateral "headers".
    IMPORTANT This microcontroller operates at (around) 3.3V." -- Discover the world of microcontrollers through Rust!

---

" SiFive? has taped out and started licensing its U54-MC Coreplex, its first RISC-V IP designed to run Linux. The design lags the performance of a comparable ARM Cortex-A53 but shows progress creating a commercial market for the open-source instruction set architecture.

A single 64-bit U54 core delivers 1.7 DMIPS/MHz or 2.75 CoreMark?/MHz at 1.5 GHz. It measures 0.234 mm2 including its integrated 32+32KB L1 cache "

---

Zigurd 22 hours ago [-]

JVMs can be simple and tiny. Pre-smartphone handsets run Java apps on 20Mhz 32-bit ARM application processors. Some of the JVMs are so simple there is no thread preemption - just round-robin. A 64kb jar is big for that kind of device.

reply

Fnoord 21 hours ago [-]

Yubikey Neo seems to run Java as well [1] [2]. I guess there was a good reason why Schwartz renamed SUNW to JAVA.

[1] https://www.reddit.com/r/yubikey/comments/6ji8ag/updating_yu...

[2] https://en.wikipedia.org/wiki/Java_Card

reply

kuschku 21 hours ago [-]

Every modern credit/debit/EC card, and every modern ID card also runs Java – in fact, that’s how the smart functionality in ID cards, drivers licenses and insurance cards is implemented: http://www.personalausweisportal.de/EN/Citizens/German_ID_Ca...

reply

---

" I mean, even outside of Shenzhen, if you want bluetooth the NRF52832 has fantastic hardware layout/design documentation and a...Cortex-M0+ processor for core logic and communicating with the BTLE stack. The ESP32 has wifi and bluetooth and has dealt with all that FCC ... (but uses a weird MCU core.) Etc, etc. "

---

the Apple A11 smartphone processor has 32k L1 icache and 32k L1 dcache https://en.m.wikipedia.org/wiki/Apple_A11

---

in response to a RISC-V question:

" Freedom E310[1] chip is available for purchase by normal consumers. It's a microcontroller and only has 16kB of RAM so it's probably not really what you are looking for. "

---

on an MCU for teaching assembly, in place of an RPi:

" And I guess application processors also probably introduce their own extra complexities, compared to a Cortex-M line which is more comparable to an Arduino than a RasPi? or Chromebook. But hey, there's also less hardware available with the sort of support that a Raspberry Pi has; the closest you'd probably get is a cheap generic $3-5 STM32F103C8 dev board. "

STM32F103C8 is an ARM Cortex-M3 MCU with 64 Kbytes Flash, 20k RAM, 72 MHz CPU, motor control, USB and CAN

---

" How much RAM does FreeRTOS? use?

This depends on your application. Below is a guide based on:

    IAR STR71x ARM7 port.
    Full optimisation.
    Minimum configuration
    Four priorities.

Item Bytes Used Scheduler Itself 236 bytes (can easily be reduced by using smaller data types). For each queue you create, add 76 bytes + queue storage area (see FAQ Why do queues use that much RAM?) For each task you create, add 64 bytes (includes 4 characters for the task name) + the task stack size. "

---

" Another factor – and probably a much more important factor – is that the machines were getting exponentially more powerful. In 1958 computers were slow, fragile, had 16K or less memory, and cost many millions of dollars. By 1972 you could buy a fast 64K PDP-7 for ~$100,000.

C held sway for a good long time. Not because computers weren’t getting more powerful; but because they were shrinking. They were shrinking in both size and cost. In 1980 you could buy an 8085 microcomputer with 64K of RAM for a few thousand dollars. And so even though the top end computers were getting more and more powerful; the bottom end machines remained the perfect size for C. "

---

" Putting these together in a product: Esperanto’s AI supercomputer on a chip. 16 64-bit ET-Maxion RISC-V cores with private L1 and L2 caches, 4096 64-bit ET-Minion RISC-V cores each with their own vector floating point unit, hardware accelerators, Network on Chip to allow processors to reside in the same address space, multiple levels of cache, etc. "