proj-jasper-jasperLowEndTargetsNotes2

---

https://www.spark.io/

    STM32F103 microcontroller
    ARM Cortex M3 architecture
    32-bit 72Mhz processor
    128KB of Flash, 20KB of RAM

i think that has either no instruction cache or an 4K or 8k one, but not at all sure.

the take-home for us is probably the amounts of flash and RAM. again, would be nice to fit the main interpreter in 16k or less, and that the upper limit is about 64k.

---

The E64G401 Epiphany-IV 64-core 28nm Microprocessor has 32KB local (but shared) memory per core (so 32KB x 64 = 2MB total).

http://en.wikipedia.org/wiki/Adapteva

http://www.adapteva.com/wp-content/uploads/2013/06/e64g401_datasheet_4.13.6.14.pdf

---

woah these are cheap:

https://en.wikipedia.org/wiki/Odroid

i think the Exynos 4412 has 32KB/32KB L1 Cache -- http://malideveloper.arm.com/develop-for-mali/development-platforms/hardkernel-odroid-u2-development-platform/

---

http://linuxgizmos.com/intel-unveils-tiny-x86-minnowboard-max-open-sbc/

Raspberry Pi: $25/$35 BeagleBone? Black: $45 MinnowBoard? SBC: $99

" tdicola 13 hours ago

link

This looks neat for people that want a cheap board to hack on embedded Linux. However for serious control of signal generation, acquisition, PWM, servos, etc. you really don't want to be running a multitasking OS. Something like the Beaglebone Black, with its dedicated 200mhz programmable units in addition to embedded Linux, is much more interesting for hackers and makers IMHO.

reply "

" stonemetal 6 hours ago

link

PRU-> programmable real time unit

BBB-> BeagleBone? Black

The BBB has an extra dual core processor that runs at 200Mhz. It is interesting because it is like the processor they teach you about in your intro to computer architecture classes, every instruction is a single cycle instruction. Since it is a co-processor(not running an OS but controllable from the BBB's OS) and execution of instructions is deterministic, it is a good choice for running hard real time code. "

" ah- 13 hours ago

link

I wouldn't call the minnowboard a microcontroller, it's more similar to other single board computers like the Pandaboard and the odroid boards. And 2GB are already common for such boards, so 4GB are really not far off.

reply "

"

outside1234 6 hours ago

link

Does anyone know how the performance on something like this stacks up to something like the Raspberry Pi?

reply

wmf 5 hours ago

link

A 1.4 GHz Silvermont must be many times faster than a 700 MHz ARM11.

reply "

"

kqr2 14 hours ago

link

Intel also has the Galileo board which is hardware and software pin-compatible with shields designed for the Arduino Uno* R3.

http://www.intel.com/content/www/us/en/intelligent-systems/g...

reply

makomk 11 hours ago

link

The Galileo's one of those boards where it's very important to pay attention to the fine print. For example, the GPIO controller is hanging off a relatively slow I2C port, so access to GPIO is much, much slower than even the lowest-end Arduino. Also, it's a modified 486 which takes multiple clock cycles to carry out many instructions that are single-cycle on modern ARM, so it's not as fast at arithmetic as the clock speed would suggest.

reply

tdicola 14 hours ago

link

Be careful though, the Galileo emulates AVR code and is orders of magnitude slower than a real Arduino. Don't expect to pick up any shield and make it work, unfortunately.

reply

jpwright 3 hours ago

link

The Galileo actually only emulates a subset of the Arduino libraries. The AVR libraries themselves are, for the most part, not supported. This makes many popular libraries unusable even when hardware is not an issue.

reply "

" elnate 14 hours ago

link

How does this (note: the MinnowBoard? SBC) compare to a Raspberry Pi?

reply

vonmoltke 9 hours ago

link

Comparing the $99 version to the B ($35):

Overall, probably worth the extra cost if you need the power and features. The question is, who does? I'm considering this for no other reason than I want a board in this form factor and power class that has SATA and PCIe.

reply

nullc 6 hours ago

link

The RPI is really obscenely slow, far slower than the clock rate would suggest even for an arm. The RPI is pretty exciting as a microcontroller, though it's power usage is very high, but as a computer it's a real disappointment.

The real comparison should be with the odroid boards: http://hardkernel.com/main/products/prdt_info.php?g_code=G13... a quad arm (cortex-a9) at 1.7GHz with 2GB ram for ~$60.

reply "

--

" Another note: In high school or my first year of college I told my dad that someday I'd own a 4K Data General NOVA. He said it cost as much as a down payment on an expensive house. I was stunned and told him I'd live in an apartment.

Why 4KB?

Because that was the minimum needed to run a higher level language. To me a computer had to have more than switches an lights. It had to be able to run programs.

" -- http://gizmodo.com/how-steve-wozniak-wrote-basic-for-the-original-apple-fr-1570573636/all

---

about micropython:

chillingeffect 1 hour ago

link

It appears from [0] that the chip of choice is the STM32F045RGT (datasheet [1]). This is from the Cortex M4f series, which includes such wonderful things as a hardware floating-point unit. That is wonderful news, although, this board appears to have no external memory, so it would be limited to 128kB.

[0] https://raw.githubusercontent.com/micropython/pyboard/master... [1] http://www.alldatasheet.com/datasheet-pdf/pdf/510587/STMICRO...

reply

--

" Micro Python has the following features:

More info at:

http://micropython.org/

You can follow the progress and contribute at github:

www.github.com/micropython/micropython www.github.com/micropython/micropython-lib "

--

according to "dec-11-ajpb-d pdp-11 basic programming manual", available from http://bitsavers.trailing-edge.com/pdf/dec/pdp11/basic/DEC-11-AJPB-D_PDP-11_BASIC_Programming_Manual_Dec70.pdf , or as text at https://archive.org/stream/bitsavers_decpdp11baASICProgrammingManualDec70_5936477/DEC-11-AJPB-D_PDP-11_BASIC_Programming_Manual_Dec70_djvu.txt ,

" A. 2 USER STORAGE REQUIREMENTS

BASIC can be run in the minimal 4K PDP-11/20 configuration. With the BASIC program in core, and deducting space reserved for the Bootstrap and Absolute Loaders, approximately 450 words are left for total user storage (program storage plus working storage) . "

i believe this 4k is 4k WORDS, and each word is two bytes, so BASIC takes up most of 8k, with about 900 bytes to spare; that is to say, BASIC takes about 7292 bytes, or just over 7k.

--

according to http://lua-users.org/lists/lua-l/2007-11/msg00248.html , Lua was under 100k on cell phones, and according to http://www.lua.org/about.html , "Under Linux, the Lua interpreter built with all standard Lua libraries takes 182K...", and according to http://www.schulze-mueller.de/download/lua-poster-090207.pdf , Lua fit into 128k ROM.

http://www.luafaq.org/#T1.33 says "Embedding Lua will only add about 150-200K to your project, depending on what extra libraries are chosen. It was designed to be an extension language and it is straightforward to ensure that any user scripts operate in a 'safe' environment (see Sandboxing.) You do not even have to embed the compiler front-end of Lua, and just use the core with pre-compiled scripts. This can get the memory footprint down to about 40K."

and as noted above, there's also the eLua project:

http://www.eluaproject.net/

" It's hard to give a precise answer to this, because this is not only dependable on the footprint of eLua or it's resource requirements but on the final user applications as well. As a general rule, for a 32-bit CPU, we recommend at least 256k of Flash and at least 64k of RAM. However, this isn't a strict requirement. A stripped down, integer-only version of eLua can definitely fit in 128k of Flash and depending on your type of application, 32k of RAM might prove just fine. We have built eLua for targets with less than 10K RAM but you can't do much than blinking an LED with them. It really largely depends on your needs. "

note that instruction sizes affect things somewhat here. if you measure things in words instead of bytes, then we have x86 variable length instruction sizes, compared with (i think?) PDP's 16-bit instruction size, and (i think) APPLE's 6502 8-bit opcodes. And newer machines require more bits per each address. Presumably then the same number of instructions may take up more room in newer machines.

--

contiki:

http://www.wired.com/2014/06/contiki

" While Linux requires one megabyte of RAM, Contiki needs just a few kilobytes to run. Its inventor, Adam Dunkels, has managed to fit an entire operating system, including a graphical user interface, networking software, and a web browser into less than 30 kilobytes of space. That makes it much easier to run on small, low powered chips–exactly the sort of things used for connected devices–but it’s also been ported to many older systems like the Apple IIe and the Commodore 64. "

--

interestingly, at the time Java was introduced, 'Java: an overview' says:

"The size of the basic interpreter and class support is about 30K bytes, adding the basic standard libraries and thread support (essentially a self-contained microkernel) brings it up to about 120K. "

--

http://www.erlang.org/faq/implementations.html

" 8.9 Is Erlang small enough for embedded systems?

..

 Rule of thumb: if the embedded system can run an operating system like linux, then it is possible to get current implementations of Erlang running on it with a reasonable amount of effort.

Getting Erlang to run on, say, an 8 bit CPU with 32kByte of RAM is not feasible.

People successfully run the Ericsson implementation of Erlang on systems with as little as 16MByte of RAM. It is reasonably straightforward to fit Erlang itself into 2MByte of persistant storage (e.g. a flash disk).

 A 2MByte stripped Erlang system can include the beam emulator and almost all of the stdlib, sasl, kernel, inets and runtime_tools libraries, provided the libraries are compiled without debugging information and are compressed: "

--

old Apple ][ 5.25inch floppy disks were apparently 140k per side.

Later there were mac 3.5 inch hard "floppy" disks that were apparently 400k, 800k (double-sided media) or 1.44 MB (double-sided, high-density)

--

"tinyScheme, which is a BSD licensed, very small, very fast implementation of Scheme that can be compiled down into about a 20K executable if you know what you’re doing."

--

(i already read this): http://www.digikey.com/en/articles/techzone/2012/jun/low-power-16-bit-mcus-expand-the-application-space-between-8--and-32-bit-options

---

the CPU in my LG-D520 phone is a Krait: , which apparently is similar to ARM A15, a 4Ki L0 icache cache and a 4Ki L0 dcache and a 16Ki L1 icache cache and a 16Ki L1 dcache (L0 and L1 per core, i assume) and a 1k L2 cache (http://en.wikipedia.org/wiki/Krait_%28CPU%29 , http://ixbtlabs.com/articles3/mobile/snapdragon-s4-p2.html)

my previous phone was a Motorola Atrix (was it the Atrix 4g? MB860? probably. the following assumes that). ARM A9 http://www.anandtech.com/show/4165/the-motorola-atrix-4g-preview/5 http://en.wikipedia.org/wiki/Tegra#Tegra_2 http://www.nvidia.com/object/tegra-superchip.html with 32Ki / 32Ki L1 cache per core, and 1Mi L2 cache

before that was a G1, ARM A11, MSM7201A . can't get good numbers on L1 cache but http://forum.beyond3d.com/showpost.php?p=1552966&postcount=47 says "The MSM7201A did not just lack L2 cache - it didn't even have a FPU! All FP operations were done in software as if we were still in the 1980s. I'm not sure how significant this was for most handheld applications which are very integer-centric, but it does make it hard to judge the real benefit of the 256KB L2. I'm also not sure how much L1 cache the MSM7201A had - since they were penny pinching on everything else, I wouldn't be surprised if it was only 16/16KB."

currently there is a new budget smartphone phone OS, Firefox OS. Looking thru https://www.mozilla.org/en-US/firefox/os/devices/ for devices where the CPU is listed, we have Qualcomm Snapdragon MSM7227A , Qualcomm Snapdragon MSM7225A , Qualcomm Snapdragon 200 , MSM8210 ,

Out of these, the lowest end one listed on http://en.wikipedia.org/wiki/Snapdragon_%28system_on_chip%29 appears to be MSM7225, with ARM11 (ARMv6; i think ARM11 is earlier/worse than Cortex A5), according to http://tomkanok.wordpress.com/2011/07/13/qualcomm-msm7x25-msm7225-msm7625-and-msm7x27-msm7227-msm7667-cpu-in-mobile-phones/ these have 16+16ki L1 cache

so, if 16KiB? is the lowest we see here, what has/had an 8KiB? L1 cache?

http://superuser.com/questions/72209/why-has-the-size-of-l1-cache-not-increased-very-much-over-the-last-20-years says an Intel i486 had 8 KiB?. http://en.wikipedia.org/wiki/CPU_cache#Example confirms and adds that each cache block was 64 bytes.

In fact, the first mention in the History section of that Wikipedia page of an L1 cache with a specified size is the i486 8Ki cache: http://en.wikipedia.org/wiki/CPU_cache#In_x86_microprocessors . http://www.karbosguide.com/hardware/module3b2.htm thinks it was first.

so, in sum: everything smartphone-capable these days seems to have 16Ki or better L1 cache. In the past, the first L1 cache seemed to be 8 Ki, on the 486. Some things have a 4k "L0 cache", which apparently is an ill-defined term ( http://forum.beyond3d.com/showthread.php?t=54666 ).

the intel quark has a 16k "cache".

"The Cortex-M0, Cortex-M0+, Cortex-M1, Cortex-M3, and Cortex-M4 processors do not have any internal cache memory."

---

apparently a lot of systems have 4k pages.

http://stackoverflow.com/questions/11543748/why-is-the-page-size-of-linux-x86-4-kb-how-is-that-calcualted

---

a guy on the SENSORICA mailing list is using this chip for a soil moisture tracker:

http://www.atmel.com/devices/attiny85.aspx

512 bytes RAM, 8k flash

the chip costs about $1-$3

i'm only noting this because, although anecdotal, it's as an example of a real-world very low end use-case that i randomly heard about (i think it's not the lowest-end chip in its series, either, suggesting that someone thinks a real world application needs either 512 bytes RAM or 8k flash (or some other spec that the manufactorer thought would go well with those amounts))

--

looks like consumer-priced massively parallel computers are still not available. Afaict the Pararella project is only contemplating a 64-CPU for $100, and that's the only one out there. Similarly, http://en.wikipedia.org/wiki/Intel_MIC has 32 cores. Some http://en.wikipedia.org/wiki/Nvidia_Tesla models at least offer on the order of 2048 cores -- but for a price of $3000.

so we're not getting much lower than $1/core yet in any offering. We need at least about 64k for $2000, or about $0.03/core. This is on the order of $0.01/core, so let's just say we need "a penny per core". Actually, that makes sense, because i was saying $2000 because a computer can cost $2000, but the CPU in that computer is much cheaper, on the order of $200 (retail). So $600 for the processors is already asking a lot.

If we have 64k processors at a penny each, that's about $655.36. At that point, enough hobbyists will be able to purchase one for applications to start being discovered at a reasonable rate.

The pararella 64-core is built off of this chip:

http://www.adapteva.com/epiphanyiv/

which has "2 MB On-Chip Distributed Shared Memory". They say:

" Memory System: The Epiphany memory architecture is based on a flat memory map in which each compute node has a small amount of local memory as a unique addressable slice of the total 32-bit address space. A processor can access its own local memory and other processors memory through regular load/store instructions, with the only difference being the latency and effective throughput of the transactions. The local memory system is comprised of 4 separate banks, allowing for simultaneous memory access by the instruction fetch engine, local load-store instructions, and by load/store transactions initiated by other processors within system. "

so that's 32k per chip

the new parallax propeller looks more minimal:

http://forums.parallax.com/showthread.php/155132-The-New-16-Cog-512KB-64-analog-I-O-Propeller-Chip

(32k per processor)

so it's anyone's guess how much k/core we'll have when we have 64k cores for $600, but between 8k and 32k is a good guess; more likely we can assume 32k.

so the Jasper runtime, including the really core libraries, shouldn't take more than half of this, 16k. sheesh, that's small. Still, it's double what the old PDP Basic version had to work with (slightly more, 'cause i think user memory had to fit in 8k along with the interpreter on that one). https://www.google.com/search?q=+basic+8k shows various BASIC versions fit in 8k. There's even some 4k BASICs: https://www.google.com/search?q=basic+4k

this suggests that if a Jasper VM or Jasper Assembly has (or initially has) fixed pseudo-pointer sizes, that 16-bit pseudo-pointers will be more than enough (especially since, if we make our unboxed primitive data elements of a uniform sizeof which is larger than 1 byte, 16k bytes of memory is less than 16 objects; e.g. 16k 16-bit objects take 32k bytes).

so, a 16-bit word size, and a corresponding 2^16 = 64k pseudo-memory size (ie limits such as no more than 64k local variables in a function, etc), seems reasonable for Jasper Assembly.

a parallax propeller cog is a 32-bit CPU, btw. So if our VM is 16 bits, we're undershooting that. Really, i just like 16 because 2^(2^2) = 16.

The Lua 5.1 VM 3-operand format operands are only 9 and 8 bits, so this is already bigger than that (although the 2-operand formats have an 8 bit operand and an 18-bit operand).

--

" Spreadtrum Communications Inc. (Shanghai) announced it can supply its SC6821 baseband processor as a part of a reference design for a $25 smartphone that runs the Firefox operating system.

Spreadtrum and Mozilla have integrated the Firefox OS with several of Spreadtrum's WCDMA and EDGE smartphone chipsets, including the SC6821, which is thought to only support 2/2.5G.

In its press statement, Spreadtrum did not provide any technical details of the SC6821 or indicate how it differs from the previously announced SC6820 or SC6825. These are single- and dual-core Cortex-A5 based chips with Mali-400 GPUs, respectively. The SC6825 has 32-kbyte instruction and data caches and a 256-kbyte L2 cache.

The SC6821 is described as having a "low memory configuration" and a "high level of integration."

It will allow handset makers to create a phone with a 3.5-inch HVGA touchscreen, WiFi?, Bluetooth, FM radio and camera functions all controlled and accessed via the Firefox OS but at prices similar to much more minimally featured budget feature phones. "

http://pdadb.net/index.php?m=cpu&id=a6821&c=spreadtrum_sc6821

it's a 32-bit ARM Cortex-A5 MPcore (ARMv7-A ISA) system

in other works, the icache and dcache hold 8k words (ea. word is 32 bits, or 4 bytes). Somewhat unrelated: note that 64k bits is 8k bytes; but here we are talking about 8k words.

--

could also look at the resources available to each processor or even each 'thread' within the GPU in low-end GPGPU systems. The Intel integrated graphics (e.g. integrated into the package or die of the CPU) GPUs supported the OpenCL? standard apparently starting with the Ivy Bridge system generation, of which a lower-end system was the 2500

NOTE: I DONT KNOW ANYTHING ABOUT GPUS OR GPGPU OR OPENCL YET, MY UNDERSTANDING OF THESE NUMBERS IS SEVERELY LACKING AND SO SOME OF THE FOLLOWING INTERPRETATIONS MAY BE WRONG!

https://compubench.com/device-info.jsp?config=12921360 :

CL_DEVICE_IMAGE2D_MAX_HEIGHT 16384 CL_DEVICE_IMAGE2D_MAX_WIDTH 16384 CL_DEVICE_IMAGE3D_MAX_DEPTH 2048 CL_DEVICE_IMAGE3D_MAX_HEIGHT 2048 CL_DEVICE_IMAGE3D_MAX_WIDTH 2048 CL_DEVICE_LOCAL_MEM_SIZE 65536 CL_DEVICE_MAX_COMPUTE_UNITS 6 CL_DEVICE_MAX_PARAMETER_SIZE 1024 CL_DEVICE_MAX_READ_IMAGE_ARGS 128 CL_DEVICE_MAX_SAMPLERS 16 CL_DEVICE_MAX_WORK_GROUP_SIZE 256 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS 3 CL_DEVICE_MAX_WORK_ITEM_SIZES (256,256,256)

And for the AMD Mali-T604, which is in the Google Nexus 10,

https://jogamp.org/bugzilla/attachment.cgi?id=581&action=edit

CL_DEVICE_LOCAL_MEM_SIZE: 32768 CL_DEVICE_MAX_COMPUTE_UNITS: 4 CL_DEVICE_IMAGE2D_MAX_HEIGHT: 65536 CL_DEVICE_IMAGE2D_MAX_WIDTH: 65536 CL_DEVICE_IMAGE3D_MAX_DEPTH: 65536 CL_DEVICE_IMAGE3D_MAX_HEIGHT: 65536 CL_DEVICE_IMAGE3D_MAX_WIDTH: 65536 CL_DEVICE_MAX_PARAMETER_SIZE: 1024 CL_DEVICE_MAX_READ_IMAGE_ARGS: 128 CL_DEVICE_MAX_SAMPLERS: 16 CL_DEVICE_MAX_WORK_GROUP_SIZE: 256 CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3 CL_DEVICE_MAX_WORK_ITEM_SIZES: [256, 256, 256] CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8

http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf :

" The maximum number of threads (256) stems from the fact that each thread can use 4 registers from a register bank that contains 1024 registers. The larger the number of registers used by the kernel, the fewer the concurrent threads. So if a kernel uses 8 registers, only a maximum of 128 threads can run in parallel. If there are enough threads to hide latency there should be no performance implication of using more registers. "

(so i guess if you wanted 128 registers, you could have 8 threads)

according to http://www.realworldtech.com/ivy-bridge-gpu/4/ , Ivy Bridge has 32kb L1 icaches, 8 threads per core (is this comparable to the 10 EUs in Haskell, see below? probably not, see http://en.wikipedia.org/wiki/Intel_HD_and_Iris_Graphics#Ivy_Bridge; maybe this is comparable to the SIMD parallelism, see below, in which case i'd hesitate to call it a 'thread'; or maybe this is the number of sub-slices per slice) and 1k general registers (32k of memory)

see also http://events-tce.technion.ac.il/files/2013/07/Michael.pdf

" Vertex shading commonly uses SIMD4x2, with 4 data elements from 2 vertices. Pixel shading is SIMD1x8 or SIMD1x16 (aka SIMD8 or SIMD16), operating on a single color from 8 or 16 pixels simultaneously. Media shaders are similar to pixel shaders, except they are packed even more densely with 8-bit data, rather than the 32-bit data used in graphics shaders. To support all these different execution modes, the GRF is incredibly versatile.

Registers are each 256-bits wide, which is perfectly suited for SIMD2x4 or SIMD8. In a 16B aligned mode, instructions operate on 4-component RGBA data, with source swizzling and destination masking. In a 1B aligned mode, instructions use region-based addressing to perform a 2-dimension gather from the register file and swizzling and destination masking are disabled. This is critical for good media performance, where 1B data is packed together for maximum density. Collectively, these two addressing modes also simplify converting from AOS to SOA data structures.

Each thread is allocated 128 general purpose registers, so the GRF has expanded to 32KB to handle 8 threads. The GRF has also been enhanced to handle larger 8B accesses that are necessary for double precision computation. "

in other words, each thread has 4k bytes worth of registers available to it; but this is in the form of only 128 256-bit registers.

http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2

looks like the fundamental component is the EU, which is grouped into 'sub-slices' (Intel), 'GCN's (AMD), or 'Kepler SMX's (Nvidia). there is a table on http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/2 showing their capabilities on the Haswell system generation. They are SIMDs whose parallelism range from 8-wide (actually, 2x4 dual issue) to 32-wide; and each subslice/GCN/SMX has from 4-10 EUs (4-10 SIMD units), for a total of 64-192 ALUs per subslice/GCN/SMX

there are up to 4 sub-slices total, or 40 EUs in Haswell (for Intel HD 2000, it was 6 EUs, for Intel HD 3000, it was 12 EUs), with peak FP ops per core/EU at around 16.

note that EUs are also called 'cores'?

so interesting numbers relating to amount of memory available per core are:

so assuming a word size of 256-bits, there we see numbers from 128 (4k bytes) to 512 (16k bytes), and up to 2048 (CL_DEVICE_LOCAL_MEM_SIZE in Intel HD 2500, divided by 32 bytes per word)

and numbers relating to parallelism hardware are:

so there we see numbers on the order of 8

and numbers relating to parallelism image size are:

so there we see numbers greater than 2048

and other numbers relating to parallelism in the OpenCL? environment are:

so there we see one 8 and a bunch of things close to 256

in summary, we see the following critical values (approximate):

http://stackoverflow.com/questions/3957125/questions-about-global-and-local-work-size says that the actual software-exposed parallelism is the 'global work size', but i don't understand how to find the max global work size for a given GPU. Maybe you can't do that without actually creating the kernel, because it depends on exactly how much memory (even how many registers) the kernel is using, etc. http://www.khronos.org/message_boards/showthread.php/9207-Determine-global_work_size suggests this is on the order of 2^17. See also http://www.khronos.org/message_boards/showthread.php/6060-clEnqueueNDRangeKernel-max-global_work_size , https://devtalk.nvidia.com/default/topic/477978/questions-about-global-and-local-work-size/.

http://people.maths.ox.ac.uk/gilesm/cuda/new_lectures/lec1.pdf says that (higher-end?) GPUs typically have on the order of 1024 (2^10) cores and 128-384 registers per core, 8192 threads (but i think it says these are 32-way SIMD, so really that's 32 independent threads?), 64k shared memory, and can support at least ~2^17 global work size.

On the number of 'cores', http://www.cs.nyu.edu/~lerner/spring12/Preso07-OpenCL.pdf cautions: "An important caveat to keep in mind is that the marketing num- bers for core count for NVIDIA and ATI aren‘t always a good r resentation of the capabilities of the hardware. For instance, on NVIDIA‘s website a Quadro 2000 graphics card has 192 “Cuda Cores”. However, we can query the lower-level hardware capa- bilities using the OpenCL? API and what we find is that in reality there are actually 4 compute units, all consisting of 12 stream mul- tiprocessors, and each stream multiprocessor is capable of 4-wide SIMD, 192 = 4*12*4. In the author‘s opinion this makes the mar- keting material confusing, since you wouldn‘t normally think of a hardware unit capable only of executing floating point operations as a “core”. Similarly, the marketing documentation for a HD6970 (very high end GPU from ATI at time of writing) shows 1536 pro- cessing elements, while in reality the hardware has 24 compute units (SIMD engines), and 16 groups of 4-wide processing elements per compute unit. 1536 = 24*16*4 . "

http://www.cs.nyu.edu/~lerner/spring12/Preso07-OpenCL.pdf also says "Clearly, for small matri- ces the overhead of OpenCL? execution dominates the performance benefits of massivly concurrent execution. For our measurements, below a matrix dimension of roughly 150 x 150 the simple multi- threaded CPU code out performs OpenCL?"

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0538e/BABGJBFI.html says "The optimal global work size must be large if you want to ensure high performance. Typically the number is several thousand but the ideal number depends on the number of shader cores in your device. To calculate the optimal global work size use the following equation: global work size = <maximum_work-group_size> * <number of shader cores> * <constant>, where constant is typically 4 or 8 for the Mali-T604 GPU"

The Mali-T604 seems to have 4-8 shader cores: " ARM announces 8-way graphics core

www.eetimes.com/document.asp?doc_id=1270826 EE Times Nov 10, 2011 - The Mali-T658 design supports up to 8 shader cores, compared with the Mali-T604's four shader cores, and ARM has also doubled the number ... "
EE Times

so that's 256*4*4 = 4096, which is close to the 2048 number we keep seeing

random note: OpenCL? allows the specification of a dependency graph of 'tasks', but 'kernels' executing in different tasks, although they may actually be executing in parallel, cannot synchronize and therefore should not try to share memory (http://www.rtcmagazine.com/articles/view/102645)

so what about the maximum local work group size? From the CL_ environment variables, both the Intel HD 2500 and the AMD Mali-T604 seems to allow 256. But http://malideveloper.arm.com/downloads/OpenCL_FAQ.pdf says that this is only if only 4 registers are used per thread? So i guess that means that if we have 128 registers per thread, then we only have a max local work group size of 8? So let's put this at 'at least 8' and ranging from 8 to 256.

Compare all this to Cell SPEs http://en.wikipedia.org/wiki/Cell_(microprocessor):

Interestingly, the OpenCL? work group size is the limit of the number of processes that can synchronize and pass information between each other using the shared local memory (GPU-local memory, but not per-processor). This is kind of the same order of magnitude that is achieved in multiprocessor systems with a shared memory (e.g. dual-core (4 threads with hyperthreading), quad-core (8 threads), 8 SPEs in the Cell processor, the 32-48 cores in initial Larrabee ( http://en.wikipedia.org/wiki/Larrabee_%28microarchitecture%29 ) (maybe multiply that by 4 because each Larrabee core could run 4 threads, so 128-192 threads), etc). Note that with larger numbers of registers used, at least in the AMD Mali-T604, this number gets less than 256, and in fact 256 is only when the numbers of registers used is very small (4); if 8 registers are used, you have 128 threads, if 16 registers are used, you have 64 threads, if 32 registers are used, you have 32 threads, etc. So this suggests that in general, these numbers may be the orders of magnitude to which coherent shared memory multiprocessing scales; between 8 and 256 independent threads. Beyond this, perhaps supporting atomic highly consistent (are we talking sequential consistency?) shared memory access becomes inefficient, and it becomes more efficient to resort to a more relaxed memory consistency model, or to alternative paradigms such as task-dependency and message-passing for IPC, as well as non-communicative data-parallelism for operations which don't need further IPC ("kernels", i guess).'

http://www.slideshare.net/mikeseven/imaging-on-embedded-gp-us-bamm-meetup-20131219 similarly says although there is a 256-thread limit, that's 64 in practice

http://www.slideshare.net/mikeseven/imaging-on-embedded-gp-us-bamm-meetup-20131219 says the Adreno 330 GPU on the Qualcomm MSM8974 has 128 bit registers, 8k local memory per core, 512 work items max, 1.5MB on-chip RAM, and says the ARM Mali T604 has 32k local memory per core.

parallax propeller 2: http://www.rayslogic.com/Propeller2/Propeller2.htm 8 cogs "P2 has 128 kB of HUB RAM, P1 has 32 kB. Both P2 and P1 have 512 longs of COG RAM, but P2 has an additional 256 longs of stack RAM in each cog."

so we can distinguish a few architectural levels here:

note: the # of data items in shared 'local' memory can be relied upon to be at least 512; (local meaning not local to each processor, but local to the GPU as opposed to ordinary main memory) (the 16k minimum value for CL_DEVICE_LOCAL_MEM_SIZE in OpenCL? 1.0, divided by 256-bit (32-byte) register width from Intel HD). However, OpenCL? 1.2 raises the minimum to 32k, and in fact the devices i looked at above all seem to have at least that much, and that's also a common cache size seen these days, and many devices have 128-bit registers instead of 256-bit; so each individual device seems to have space for at least 2048 items in local memory, which is another number we saw a lot in the Intel HD OpenCL? environment. And even pretty old computers had 4k of memory, which is 2048 words if you have 16-bit words. So let's assume 2048.

so we can rely on at least approximately:

now 128 registers seems excessive, often we'll want more synchronizable parallelism at the expense of less memory, so let's cut that down:

in summary, we can rely on at least approximately:

in bits:

rounding up (to get bit width for addressing for a fixed-length instruction set) and down to powers of 2 (to get the minimal numbers that we can rely upon), in bits:

so max bit width for fixed length addressing:

note: this argues for:

and min that we can rely upon:

in summary, we can rely on at least approximately:

CONCLUSION/NOTE: BASED ON THESE NUMBERS, IT LOOKS LIKE A CONNECTION-MACHINE LIKE ACTIVE-DATA-STYLE PROGRAMMING COULD BE EMULATED BY PRESENT-DAY CONSUMER GPUs!! Even though there are only 2 cpus in lower-end consumer computers and ~8 gpu units, there's no need to wait for computers with 64k cpuS because, since in the paradigm we are targetting, (virtual) processors only need to synchronize with their immediate neighbors, all 64k of these don't share a synchronizable local memory, so we can emulate them using data-parallelism.

SUGGEST RE-TARGETTING JASPER AT OPENCL

---

---

interesting history/economics on integrated graphics: http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested

---

a picture while discussing L4 cache in Crystalwell's eDRAM:

http://www.anandtech.com/show/6993/intel-iris-pro-5200-graphics-review-core-i74950hq-tested/3

so memory hierarchy jumps after 32k, 256k, 4M. Also, the text notes that both Intel and Microsoft found that 32M was a good amount of eDRAM to have.

---

the iPhone 1 had 128MB of RAM

one earlier iphone ("2g") had 16kb l1 cache (samsung S3C6400) (arm11 32-bit, so 4k items of 4 bytes apiece)

http://www.doc88.com/p-013708140603.html

an earlier iphone had no L2 cache, but a later one had 256k (so 64k items)

http://tooth2.blogspot.com/2010_03_01_archive.html

---

http://users.ece.cmu.edu/~koopman/stack_computers/sec4_1.html

" 4.1 CHARACTERISTICS OF 16-BIT DESIGNS

The systems discussed here are 16 bits wide because that is the smallest configuration that makes sense for most commercial stack processor applications.

4.1.1 Good fit with the traditional Forth model

The primary motivating factor for making Forth machines 16 bits wide is that the Forth programming model has traditionally been 16 bits. This is consistent with average Forth program sizes of less than 32K bytes and the implementation of most of the first Forth compilers on microprocessors with 64K byte address ranges.

4.1.2 Smallest interesting width

A major reason that Forth has historically been a 16-bit language is that 8 bits is too small for general purpose computations and addressing data structures. While 12 bits was tried in some of the earliest minicomputers, 16 bits seems to be the smallest integer size that is truly useful. Forth traditionally has not used more than a 16-bit computing model because it was developed before 32-bit microprocessors were available.

16-Bit machines are capable of addressing 64K of memory, which for a stack machine is a rather large program memory. 16-Bit machines have single precision integers in the range of -32 768 to +32 767 which is large enough for most computations. Using double precision (32-bit integers), a 16-bit machine can represent integers in the range of -2 147 483 648 to +2 147 483 647, which is large enough for all but the most demanding applications.

Of course, a machine with a 4-bit or 8-bit data path can be made to emulate a 16-bit machine. The result is generally unsatisfactory performance, because an 8-bit machine can be expected to be about half as fast as a 16-bit machine when manipulating 16-bit data. Since the machines discussed in this chapter are all designed for high speed processing, all have 16-bit internal data paths.

4.1.3 Small size allows integrated embedded system

The three Forth chips discussed in this chapter (the M17, NC4016, and RTX 2000) are all targeted at the embedded applications market. Embedded applications require a small processor with a small amount of program memory to satisfy demanding power, weight, size, and cost considerations. A 16-bit processor is often a good compromise that provides higher levels of performance than an 8-bit processor, which probably would need to spend a lot of time synthesizing 16-bit arithmetic operations, and a 32-bit processor, which is overkill for many applications. "

" 5.1 WHY 32-BIT SYSTEMS?

The 16-bit processors described in Chapter 4 are sufficiently powerful for a wide variety of applications, especially in an embedded control environment. But, there are some applications that require the added power of a 32-bit processor. These applications involve extensive use of 32-bit integer arithmetic, large memory address spaces, or floating point arithmetic.

One of the difficult technical challenges that arises when designing a 32-bit stack processor is the management of the stacks. A brute force approach is to have separate off-chip stack memories in the manner of the NC4016. Unfortunately, on a 32-bit design this requires having 64 extra pins for just the data bits, making the approach unpractical for cost-sensitive applications. The FRISC 3 solves this problem by maintaining two automatically managed top-of-stack buffers on-chip, and using the normal RAM data pins to spill individual stack elements to and from program memory. The RTX 32P simply allocates a large amount of chip space to on-chip stacks and performs block moves of stack elements to and from memory for stack spilling. Chapter 6 goes into more detail about the tradeoffs involved with these approaches. "

" 8.2 16-BIT VERSUS 32-BIT HARDWARE

8.2.1 16-Bit hardware often best

16-bit stack processors in general have lower costs than 32-bit processors. Their internal data paths are narrower, so they use fewer transistors and cost less to manufacture. They only need 16-bit paths to external memory, so they have half as many memory bus data pins as 32-bit processors. System costs are also lower, since a minimum configuration 16-bit processor only needs to have half the number of memory chips as a 32-bit processor for a single bank of memory. ... 16-Bit processors should always be evaluated for an application, then rejected in favor of 32-bit processors only if there is a clear benefit for the change.

8.2.2 32-Bit hardware is sometimes required

... 32-Bit stack processors should be used instead of 16-bit processors only in cases where the application requires high efficiency at one or more of the following: 32-bit integer calculations, access to large amounts of memory, or floating point arithmetic.

"

-- http://users.ece.cmu.edu/~koopman/stack_computers/sec8_2.html

---

https://en.wikipedia.org/wiki/Calxeda apparently had these manycore building block product:

"In March 2011 Calxeda announced a 480-core server in development, consisting of 120 quad-core ARM Cortex-A9 CPUs.[3][4][5] .. EnergyCore? ECX-1000, featuring four 32-bit ARMv7 Cortex-A9 CPU cores operating at 1.1–1.4 GHz, 32 KB L1 I-cache and 32 KB L1 D-cache per core, 4 MB shared L2 cache, 1.5 W per processor, 5 W per server node including 4 GB of DDR3 DRAM, 0.5 W when idle.[8][9] Each chip included five 10 gigabit Ethernet ports. Four chips are carried on each EnergyCard?.[8] "

Tilera's TILE-Gx8072 with 72 processors has

" Seventy-two cores operating at frequencies up to 1.2 GHz • 64-bit architecture (datapath and address)

...

32 KB L1 instruction cache and 32 KB L1 data cache per core • 256 KB L2 cache per core

"

---

http://www.realworldtech.com/haswell-cpu/2/ says the Haswell and Sandy Bridge front end includes a 1.5K "L0" uop cache, in front of the 32k L1 icache.

i guess that's about the same as a 32k icache, if you assume that there's about 1 uop per 16 bytes? But it's probably more like 1 uop per 6 bytes.

--

some 'matchbox pcs':

http://matchboxpc.thydzik.com/ https://en.wikipedia.org/wiki/Geode_%28processor%29#Geode_GXLV x86 processor 16k unified L1 cache

tiqit (pratt's): used one of these (they call it a '486sx'): http://www.cpu-world.com/CPUs/ElanSC400/ and they reference the expired page http://www.amd.com/products/lpd/techdocs/e86/21030.pdf which is probably http://support.amd.com/TechDocs/21030.pdf . Section 3.4 of that reference manual says the system has an 8k unified L1 cache (and no L2 cache) and uses a 486 instruction set at 100 mhz with no floating point unit.

http://www.pcworld.com/article/2044279/16-small-but-powerful-matchbox-pcs.html 16 small but powerful matchbox PCs By Serdar Yegulalp, Computerworld, Computerworld Jul 13, 2013 9:00 AM tiny PCs that even have their own keyboards like the Qi Ben NanoNote?

..." education-oriented Raspberry Pi ($35)

hobbyist-and-manufacturing-oriented Gumstix Overo series (from $99-$229).

hacker-friendly BeagleBone? Black ($44.95). These are three of the most popular devices in this category.

Other devices have surfaced in the wake of the success of the Raspberry Pi and its peers, each a variant on the theme.

Clockwise, starting at top left: The Gooseberry (about $62) is a repurposed printed circuit board assembly originally developed for tablets rather than an original design like the Pi, but no less use useful for that.

The Rascal Micro ($199) eschews video connectivity in favor of networking, so it can be used as a miniature headless system for controlling other devices.

And the PandaBoard? (and PandaBoard? ES, its successor), at around $175, is pricier than the Pi; it sports a few more connectors and slightly more expandability "

(by ..."' i mean 'contains mostly quotes but with much ellipsis and perhaps even paraphrasing')

$89 Korean-made Odroid U2 (left) packs in a Exynos4412 Prime ARM Cortex-A9 quad-core processor, much faster than the Pi's ARM-powered Broadcom SoC?.

Another board that's used widely in automation projects is the Arduino (right), now available in a whole cornucopia of editions.

The emphasis here isn't on power or speed, though: The Arduino Uno ($55 for the bare board, $60 for a retail box version), shown here, sports an 8-bit RISC processor running at a mere 80MHz (an Intel Core i7 runs around 3GHz).

Boxed up and ready to go

Many matchbox systems come as a bare board, for which you have to supply your own case. These units, on the other hand, come packaged in a case of some kind, courtesy of the manufacturer. They are often used as mini-media centers.

Clockwise, starting at top left: The Cotton Candy ($199) and Rikomagic (about $86) both run Android, while the CuBox? ($119) has additional hobbyist-friendly features, such as a recovery mode that prevents it from being bricked by mistake.

Almost a full PC

These built-up matchbox systems offer a little more breathing room.

Clockwise from top left: The Trim-Slice H packs not only an ARM Cortex-A9 processor and an NVIDIA Tegra 2 chipset but a 2.5-inch SATA hard disk into a fanless case. Prices start at $279, with developer kits available at $175.

The folks at Cappuccino PC build full-blown Intel systems (Atom or Core, your choice); the fanless SlimPro? SP675FP, shown here, measures 10 in. on its longest side and sells for $685.

CompuLab?'s fit-PC3, which starts at $275 with minimal configuration, uses a dual-core 64-bit AMD processor with a 2.5-in. hard disk and a Radeon HD 6250 or 6320 GPU.

Keyboard included

Some even come with a keyboard.

Clockwise, from top left: The Ben NanoNote? runs its own custom build of OpenWrt?, the Jlime distribution or anything else you can get to run on its 336MHz MIPS processor. Only 1,500 pre-manufactured units were made, but the hardware design is available as an open project.

Next up in size, the OpenPandora? (starting at $479), is billed as a mixture of PC and gaming console and is only a little larger than the Nintendo DS.

The Gecko Surfboard ($119) packs an Intel-powered system into a standard-sized keyboard but only uses 5 watts -- hearkening back to the everything-in-the-keyboard design of the Commodore 64/128.


education-oriented Raspberry Pi ($35) 32k unified l1 cache (eg like 16k icache)

hobbyist-and-manufacturing-oriented Gumstix Overo series (from $99-$229). 16k unified cache? ( https://pixhawk.ethz.ch/omap/start )

hacker-friendly BeagleBone? Black ($44.95) 32K/32K L1 cache

The Gooseberry (about $62) Allwinner A10 ARM Cortex-A8 (32+32 L1 cache, 512k L2 cache), Mali 400 graphics

The Rascal Micro ($199) AT91SAM9G20B-CU ? 400 MHz ARM (ARM926EJ-S), 32+32 l1 cache

PandaBoard? $175 TI OMAP4430 dual-core ARM Cortex-A9 CPU, with two ARM Cortex-M3 cores, 32+32k l1 cache

$89 Odroid U2 (left) packs in a Exynos4412 Prime ARM Cortex-A9 quad-core processor 32KB/32KB L1 Cache

Arduino Uno ($55 " The Arduino UNO has only 32K bytes of Flash memory and 2K bytes of SRAM" (and no cache?) https://learn.adafruit.com/memories-of-an-arduino/arduino-memory-architecture

Cotton Candy ($199) 1.2 GHz Exynos 4210 ( ARM Cortex-A9 (32+32k l1 cache) with 1MB L2 cache), Mali 400 graphics

Rikomagic (about $86) 32k+32k ( http://complete-concrete-concise.com/blog/raspberry-pi-and-the-mk802-a-side-by-side-comparison )

CuBox? ($119) Marvell Armada 510 (88AP510) SoC? with ARM v6/v7 (32/32 l1 cache?)

Gecko Surfboard ($119) https://en.wikipedia.org/wiki/Vortex86 16+16k l1 cache

intel galileo 16 KB L1 cache ( http://www.mouser.com/applications/open-source-hardware-galileo-pi/ ) "Adruino says it's 400MHz 32-bit Intel® Pentium instruction set architecture (ISA)-compatible processor o 16 KBytes on-die L1 cache which does not say us much: 80846 and Pentium has very little difference from is ISA POV and later models of both had 16 KByte cache thus 80486 looks plausible, too. "

---

http://iqjar.com/jar/an-overview-and-comparison-of-todays-single-board-micro-computers/

---

discussion on musl:

https://news.ycombinator.com/item?id=4058663

vasco 830 days ago

link

Don't really get the theory behind changing the default stack size for threads. Feels like they did it just to be different which might get someone scratching their heads for a bit.


dalias 830 days ago

link

The glibc default thread stack size is unacceptable/broken for a couple reasons. It eats memory like crazy (usually 8-10 megs per thread), and by memory I mean commit charge, which is a highly finite resource on a well-configured system without overcommit. Even if you allow overcommit, on 32-bit systems you'll exhaust virtual memory quickly, putting a low cap on the number of threads you can create (just 300 threads will use all 3GB of address space).

With that said, musl's current default is way too low. It's caused problems with several major applications such as git. We're in the process of trying to establish a good value for the default, which will likely end up being somewhere between 32k and 256k. I'm thinking 80k right now (96k including guard page and POSIX thread-local storage) but I would welcome evidence/data that helps make a good choice.

---

I'm currently exploring the TI CC3200, it has a good price point (30$ via TI) and is quite capable (80Mhz M4 Cortex, 256k RAM) - for me its close to the sweet spot. If TI lowered the price even more, they'd own the market.

---

http://www.etalabs.net/compare_libcs.html

Bloat comparison musl uClibc dietlibc glibc Complete .a set 412k 360k 120k 2.0M † Complete .so set 516k 520k 185k 7.9M † Smallest static C program 1.8k 7k 0.2k 662k Static hello (using printf) 13k 51k 6k 662k Dynamic overhead (min. dirty) 20k 40k 40k 48k Static overhead (min. dirty) 8k 12k 8k 28k Static stdio overhead (min. dirty) 8k 20k 16k 36k Configurable featureset no yes minimal minimal

License MIT LGPL 2.1 GPL 2 LGPL 2.1+ w/exceptions

so we see from the above that even the smallest libc is probably too big for our 4k/8k/16k/128k dreams

---

random article on Lisp machines:

https://news.ycombinator.com/item?id=8340283

also mentioned the Reduceron which apparently ran Haskell or something like it

http://thorn.ws/reduceron/Reduceron/Practical_Reduceron.html

" Reduceron is a high performance FPGA soft-core for running lazy functional programs, complete with hardware garbage collection. Reduceron has been implemented on various FPGAs with clock frequency ranging from 60 to 150 MHz depending on the FPGA. A high degree of parallelism allows Reduceron to implement graph evaluation very efficiently. "

---