proj-oot-lowEndTargets-lowEndTargetsUnsorted6

https://blog.cloudflare.com/branch-predictor/

" Some time ago I was looking at a hot section in our code and I saw this:

	if (debug) {
    	  log("...");
    }
    

This got me thinking. This code is in a performance critical loop and it looks like a waste - we never run with the "debug" flag enabled[1]. Is it ok to have if clauses that will basically never be run? Surely, there must be some performance cost to that... ...

Top tip 1. On this CPU a branch instruction that is taken but not predicted, costs ~7 cycles more than one that is taken and predicted. Even if the branch was unconditional. ... Top tip 2: conditional branches never-taken are basically free - at least on this CPU. ... Top tip 3. In the hot code you want to have less than 2K function calls - on this CPU. ... Top tip 4. On this CPU it's possible to get <1 cycle per predicted jmp when the hot loop fits in ~32KiB?. ... Top tip 5. On Intel avoid placing your jmp/call/ret instructions at regular 64-byte intervals. ... Top tip 6. on M1 the predicted-taken branch generally takes 3 cycles and unpredicted but taken has varying cost, depending on jmp length. BTB is likely linked with L1 cache. ... On x86 the hot code needs to split the BTB budget between function calls and taken branches. The BTB has only a size of 4096 entries. There are strong benefits in keeping the hot code under 16KiB?.

On the other hand on M1 the BTB seems to be limited by L1 instruction cache. If you're writing super hot code, ideally it should fit 4KiB?.

Finally, can you add this one more if statement? If it's never-taken, it's probably ok. I found no evidence that such branches incur any extra cost. But do avoid always-taken branches and function calls. "

so, we're seeing numbers like 4k, 16k, 32k memory, and 2k number of function calls.

the conclusion is also interesting for unrelated reasons: "Finally, can you add this one more if statement? If it's never-taken, it's probably ok. I found no evidence that such branches incur any extra cost. But do avoid always-taken branches and function calls."

---

"Winterbloom's primary microcontroller is the Microchip SAM D series- specifically, the SAM D21, SAM D51, and SAM D11.

...

SAM D11 SAM D21 SAM D51 CPU Cortex-M0+ Cortex-M0+ Cortex-M4F Clock speed 48 MHz 48 MHz 120 MHz Max flash 16 kB 256 kB 1024 kB Max RAM 4 kB 32 kB 256 kB "

---

http://www.mynor.org/ http://www.mynor.org/tranor " MyNOR? is a single board computer that does not have a CPU. Instead, the CPU consists of discrete logic gates from the 74HC series. This computer also has no ALU. Only a single NOR gate is used to perfom all computations such as addition, subtraction, AND, OR and XOR. This computer is not fast, it is rather slow. MyNOR? can only perform 2600 8-bit additions per second, although it is clocked at 4 MHz. This is because everything is done in software. MyNOR? has only a 32 kB ROM for program storage, but this is more than enough. The very slim microcode occupies only 9 kB, the remaining 23 kB are used for the application program. "

" TraNOR? - A computer built with transistors

This is not a transistor computer. In fact, it is a computer with a CPU made up of discrete transistors, and the CPU instructions are composed by microcode stored in the same memory that also contains the application program.

This computer consists of 1897 transistors for the CPU, 598 transistors for the four 8-bit I/O-ports and additional 124 transistors for the LCD I/O board. Furthermore, three integrated memory chips were used. Of course it would be possible to build the SRAM chip with transistors and the ROM chip with transistors and diodes as well, but this would have required much more transistors.

The complexity of my design is somewhere between the Intel i4004 CPU (2250 transistors) and the 6502 (3218 transistors). The i8080 already has 4500 transistors and the Zilog Z80 even has 8500 transistors, so you get an idea how small my design still is. The most complex component that I use in the computer is the EEPROM memory chip. And even the LCD is also more complex than the TraNOR? CPU.

My design goal was to build a transistorized computer that is 100% compatible with MyNOR?. I have reached this goal, software written for MyNOR? runs on TraNOR? with the same speed. And TraNOR? doesn't need a special EPROM image either, it also works with MyNOR? ROM v1.0 and later versions. "

" Note: I could have built the computer without any memory chips. The 32 kB EPROM (ROM) could be replaced by a large diode matrix or a core rope memory, and the 8 kB SRAM could be replaced with many discrete flip-flops or discrete DRAM memory cells. And the 64 kB EEPROM is not needed at all. But for the sake of simplicity, I decided to use these chips anyway. "

" No CPU or MCU, no ALU, only one discrete NOR gate for computations The 8-bit CPU is made of 15 CMOS logic chips, 2 transistors, a ROM and a RAM chip additional 4 CMOS logic chips are used to provide digital I/O 8 kB SRAM for CPU registers, program code and data 32 kB ROM (OTP EPROM) for microcode and program storage 64 kB EEPROM for 8 user programs, with auto-boot after power on ... Hardware interrupts (except the non-maskable hardware reset) are not supported A stack memory of 256 byte enables nested subroutine calls Up to 24 digital outputs and 8 digital inputs with integrated pull-up's ... Slim microcode architecture with 28 instructions The microcode occupies only 9 kB of the ROM, 23 kB are free for the OS The Operating System provides lots of useful API functions The OS contains a calculator program that can do floating point calculations The OS contains a monitor program which allows directly programming MyNOR? in assembly

"

"

Instruction Set

MyNOR? is a CISC (complex instruction set) CPU with von-Neumann architecture. Programcode and data are stored together in the same RAM. Furthermore the RAM is used to store the stack memory and also the CPU registers. Because CPU registers are stored in RAM, MyNOR? is capable of dealing with up to 256 8-bit registers.

Instruction Function Instruction Function LD reg,# Load register with immediate value SUB reg Subtract register from ACCU (with carry) LD reg,reg Load register with other register XOR reg Perform XOR operation on ACCU and register LDA # Load ACCU with immediate value CMP reg Compare ACCU with register and set FLAG LDA reg Load ACCU from register CMP # Compare ACCU with immediate value and set FLG STA reg Store ACCU to register TST reg Test register for zero and set FLAG LAP Load ACCU through pointer JMP abs Unconditional jump to absolut memory address SAP Store ACCU through pointer JNF abs Jump to absolut memory address if FLAG = 0 ADD reg Add register to ACCU (with carry) JPF abs Jump to absolut memory address if FLAG = 1 AND reg Perform AND operation on ACCU and register JSR abs Call subroutine DEC reg Decrement register RET Return from subroutine INC reg Increment register RST Reset the CPU OR reg Perform OR operation on ACCU and register IO port Input or Output ACCU on port ROL reg Rotate register left (with carry) PSH reg Push register to stack ROR reg Rotate register right (with carry) POP reg Pull register from stack

I have optimized the instruction set a lot, so programming becomes convenient and efficient. The Cross Assembler "myca" provides some special macro instructions to make programming even more convenient: Instruction Function ADD # Add immediate value to ACCU AND # Perform AND operation on ACCU and immediate value OR # Perform OR operation on ACCU and immediate value SUB # Subtract immediate value from ACCU (with carry) XOR # Perform XOR operation on ACCU and immediate value CLC Clear (carry) FLAG SEC Set (carry) FLAG

If you are interested in a full description of the registers and the instruction set, please read the MyNOR-Instruction-Set documentation. "

---

"A GC sweep on a phone with 128k of heap is a very different thing than a desktop with a multi-GB heap." [1]

Boris Chuprin @noop_dev Replying to @Simon_Fe1 I believe Java was a part of the early IoT? vision - a language for creating safe downloadable code for cheap 32-bit single-CPU appliances and thin clients. https://en.m.wikipedia.org/wiki/Green_threads I personally believe they should have left 1OS thread/VM limitation and made VMs communicate via MP

---

https://www.youtube.com/watch?v=7ybybf4tJWw Doom on an IKEA TRÅDFRI? lamp! IKEA TRÅDFRI? RGB GU10 lamp (IKEA model: LED1923R5). https://next-hack.com/index.php/2021/06/12/lets-port-doom-to-an-ikea-tradfri-lamp/

Cortex M33, 80 MHz 108 kB RAM 1 MB internal flash 8 MB external dual-SPI flash Silicon lab's MGM210L RF module

---

dbcurtis on Aug 13, 2019 [–]

These have their uses.

A friend who frequently does contract development in the toy space has (or at least used to have) a favorite go-to MCU that costs under $0.06 in bare die. It is essentially a 6502 with 100 bytes of RAM and a metric butt-load of mask-programmable ROM. It was originally designed for greeting cards. He has designed it into toys.

It is hard to use, you need a dev kit and a good relationship with the distributor to get the documentation. It only makes sense in high-volume products, since it comes as passivated bare die so assembly requires a die-bonder and expoxy encapsulation depositer.

Not for everyday use. But as my friend says: “You haven’t lived until you have spent an entire afternoon arguing over $0.05 on the BOM.”

jerryr on Aug 13, 2019 [–]

That sounds very similar to my experience with toy development. For a toy that played a bunch of pre-recorded sounds, we used a 4-bit Winbond MCU (their MCU division is now Nuvoton) that had a tiny bit of RAM and a ton of mask ROM. Firmware development was done in assembly and targeted a huge (physically large) emulator for test/debug. When we were satisfied with the firmware, we'd send it off to our CM, who would then order the parts with our FW in ROM. They'd get back bare die parts, which were wire bonded to the PCB and then epoxied over (that miserable "glop top" packaging, which is the bane of many teardowns). Development was a bit painful, but high volume production was extremely cheap.

Edit: Oops. I conflated projects. The toy project actually used a SunPlus? MCU, not a Winbond MCU. It was an 8-bit RISC CPU running at 5MHz with 128 bytes RAM and 256KB mask ROM. The ROM held both the program and audio samples. I don't recall what encoding was used for the audio.

nickpsecurity on Aug 13, 2019 [–]

Well, I'm curious which project used a 4-bitter. Jack Gansle and Robert Cravatta did a survey a while back:

http://www.embeddedinsights.com/channels/2010/12/10/consider...

http://www.ganssle.com/rants/is4bitsdead.htm

The two examples were timepiece designs and Gilette Fusion ProGlide?. On top of getting yours, I'm curious if any of these cheap MCU's in the article could today have met whatever your requirements were for a 4-bitter?

jerryr on Aug 13, 2019 [–]

It was also for a low-cost audio application, but it wasn't a toy. This was back in 2001 or so. The MCUs in this article all only have ~1KB ROM, which wouldn't have been enough for our audio samples. We needed >256KB. The "4-bitness" was just incidentally what Winbond offered with a large ROM at the time. However, the SunPlus? that we later used in the toy also offered a large ROM with an 8-bit CPU for a similar cost. So, while I can't authoritatively say that 4-bit is dead, it does seem like there are a lot of alternatives in similar price ranges now.

andrehacker on Aug 13, 2019 [–]

Ah, yes, there was an article here a year back about the original Furby using that same configuration. The article actually had the annotated 6502 source code.

https://news.ycombinator.com/item?id=1775159

somesortofsystm on Aug 13, 2019 [–]

>6502

Without question, one of the nicest platforms to have in multitudes of thousands, at low energy and cost ..

kragen on Aug 14, 2019 [–]

Honestly I'd prefer an ARM or an 8086 or an AVR. I imagine I'd prefer a J1A too but haven't tried. The 6502 makes it a pain to do anything in any language higher-level than assembly, even C, and being 8-bit means you're constantly facing tradeoffs between making things fast or making them correct for more than 128 or 256 items.

mastax on Aug 13, 2019 [–]

Some Philips sonicare toothbrushes use(d) a 4-bit microcontroller from an obscure Swiss company. (From memory, since I can't find the EEVBlog video teardown) 52 bytes of RAM, custom size mask ROM, ? Kilohertz clock speed. It makes sense, they just needed a timer for the "2 minutes of brushing is up" feature, and maybe some battery management. It still surprised me that it was worth the hassle to save a few cents, even if they sell millions of the things. They must make insane margins: $80 for a vibration motor and $25 brush refills.

kens on Aug 14, 2019 [–]

I did a teardown of a Sonicare toothbrush that used an 8-bit PIC 16F1516 microcontroller. There's a lot more going on in the toothbrush than I expected. I expected a simple motor, but there's a mechanically-complex resonant coil mechanism, driven by an H-bridge. There's some expensive manufacturing in there. Another interesting thing was the toothbrush has a "pressure sensor" to tell if you're brushing too hard, but it's really a Hall-effect sensor.

http://www.righto.com/2016/09/sonicare-toothbrush-teardown.html

kken on Aug 13, 2019 [–]

Maybe one of these?

https://www.emmicroelectronic.com/catalog?title=&term_node_tid_depth=29

EM Microelectronics is actually not so obscure. They belong to the Swatch group and are specialized in ultra lower power analog and mixed signal circuit. Obviously, first for watches.

fpgaminer on Aug 13, 2019 [–]

Are you thinking of this Braun teardown? https://www.youtube.com/watch?v=JJgKfTW53uo That indeed uses a 4-bit micro from a Swiss company (The only sonicare teardown I found was a forum post)

mastax on Aug 13, 2019 [–]

Yep, that must be it. Thanks!

kragen on Aug 13, 2019 [–]

Designs like the GreenArrays?