Bayle Shanks's website: proj-plbook-plChLowLevelConstructs

Table of Contents for Programming Languages: a survey

chapter: low-level constructs

inlining

malloc

pointers

e.g. "direct memory addressing"

and pointer arithmetic

control of actual representation of data structures

sequence points

In C, code like "int a = 41; a = a++" apparently compiles but leaves 'a' in an undefined state because "you can only update a variable once between sequence points" or it becomes undefined, but on many compilers works anyway. A sequence point is "a point in the program's execution sequence where all previous side effects SHALL have taken place and all subsequent side-effects SHALL NOT have taken place". the above paragraph are my words outside of quotes; quotes are from http://www.slideshare.net/olvemaudal/deep-c

computed goto

if you are writing an interpreter, commonly the structure will involve a dispatcher which looks at the next instruction to be executed, execute a segment of code corresponding to that instruction, and then returns to dispatch. You might think that the best way to do this would be a switch statement on the next instruction. But actually the best way is to make a dispatch table, and then at the end of each instruction-specific code segment to look at the next instruction, lookup the address to jump to out of the dispatch table, and then jump to it. There are two reasons this is faster. First, a switch statement does some bounds-checking overhead, but if you know beforehand that the next instruction is a valid instruction that appears somewhere in the dispatch table, then you don't have to do this. Second, most hardware branch predictors will work better because they can associate a different branch probability table with each instruction, rather than just associating one branch probability table with the single jump in the switch statement.

Empirically, this is about 20% faster. Implementations of Python, Ruby, and Java (on Dalvik, the Android Java VM) use this. But many high level languages don't provide a computed goto instruction, so you can't do this in them. Writing an interpreter is a niche need, but an important one for platform metaprogrammability.

http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables : computer goto in C can be looked at as one feature, computed goto, or as 2 features; take the address of a label, and goto an address in a variable (indirect jump)

Computed gotos can be implemented via tail calls.

benefits over a huge 'switch' or 'while':

clarity
"no needless bounds checks in the switch statement" [1]
"making better use of the processor's branch prediction unit" [2]

links: http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables/ http://gcc.gnu.org/onlinedocs/gcc-3.2/gcc/Labels-as-Values.html http://www.emulators.com/docs/nx25_nostradamus.htm

initializing things once

e.g. in C++, if you have:

class A { public: A(int sz) { sz_ = sz; v = new B[sz_]; } ~A() { delete v; } ...private: ... B * v; int sz_; };

then sz_ is first implicitly initialized to its default value (0), and then it is set to sz in a second step.

But if you instead define the constructor as:

A(int sz) : v(new B[sz]), sz_(sz) {}

then sz_ will only be assigned to once.

memory management

regions for memory management

Region-Based Memory Management in Cyclone defines "region-based" memory-management as "each object lives in one region and, with the exception that a distinguished heap region may be garbage collected, a region’s objects are all deallocated simultaneously", citing M. Tofte and J.-P. Talpin. Region-based memory management. Information and Computation, 132(2):109–176, 1997.

note: in Rust, regions are called 'lifetimes' [3]

API (eg a C API), calling from other languages, calling out to other languages, embedding

Generally, the term "FFI" (foreign function interface) means a mechanism for calling out from the language with the FFI to another language, and a "C API" means a mechanism to call functions in the language with the API from C. 'Embedding' means a mechanism for a C program to initialize and call code in the embedded language (this is distinct from an 'embedded system', which means "a computer system with a dedicated function within a larger mechanical or electrical system, often with real-time computing constraints").

A language implementation runs on a 'platform', which might be an operating system, or the runtime of another programming language, for example the JVM or CLR. Often languages that run on other languages' runtimes have facilities to call out to, and to be called by and embedded by, other programs running on the platform.

A language might be able to interoperate with other languages besides C (and to use non-C calling conventions in its API and/or FFI), but as of this writing C is the de-facto standard choice for interoperability conventions, outside language runtime platforms such as the JVM or CLR.

The Evolution of Lua by Ierusalimschy, Henrique de Figueiredo, Celes, Section 6.9, 'C API', PDF page 17, is instructive. It notes that C APIs to dynamic, memory-managed languages must deal with two mismatches (static/dynamic, and manual memory management/garbage collection), describes how Lua deals with this:

"an abstract stack to exchange data between Lua and C. Every C function called by Lua gets a new stack frame that initially contains the function arguments. If the C function wants to return values to Lua, it pushes those values onto the stack just before returning. Each stack slot can hold a Lua value of any type. ... To prevent the collection of Lua values in use by C code, the values in the stack are never collected. When a C function returns, its Lua stack frame vanishes, automatically releasing all Lua values that the C function was using. These values will eventually be collected if no further references to them exist....Support for multiple, independent Lua states was achieved by eliminating all global state...Lua 4.0 introduced explicit Lua states in the API...All C code that communicated with Lua (in particular, all C functions registered to Lua) had to...include an explicit state argument in calls to the C API"

The paper also describes other mechanisms that were not chosen, and why:

providing an API to retrive arguments into C values, rather than using an abstract stack, not giving any memory-management control to C, and assuming that a C program called from Lua might:
- not garbage-collect any reference the C program is given until the C program terminates
- does not provide the C program with any fine-grained control for when references to memory objects requested using the API are not longer held; eg:
  - memory might be reclaimed too late: any object created by Lua in response to a C API call could not be disposed of until the C function requesting the object terminates
  - memory might be reclaimed too soon: and on the other hand there would be no way for C to hold onto an object through multiple calls.
providing a 'lock' and 'unlock' API to tell Lua to hold onto a value across C calls is error-prone (the C program could forget to lock or unlock something), and in addition only solves the problem of memory being reclaimed too soon, not the problem of memory being reclaimed too late.
providing a 'beginblock' 'endblock' mechanism was thought to be error-prone
inspecting the C stack, as Ruby does, is non-portable
proving a reference counting API, as Perl and Python do, was thought to be error-prone

pcwalton 4 hours ago

Golang implements a custom scheduler in userspace too, and this is why its FFI is not as fast or as well integrated with the OS as the calling conventions of C or C++ are. Heck, Golang's FFI is even less straightforward than that of Java, since the JVM is typically 1:1, not M:N.

Once you've gone M:N you're effectively in your own little world. There's a reason Golang's runtime is compiled with the Golang toolchan, and why it invokes syscalls directly as opposed to going through libc.

 rilut 3 hours ago

Sorry, what is M:N?

gilgoomesh 2 hours ago

Any situation where you have M userspace jobs running on N system threads, i.e. the number of tasks is different to the number of system threads.

Normally this occurs because you're running a large number of "green" threads on your own scheduler which schedules onto a thread pool underneath. This is good if all your threads are small/tiny since userspace thread creation is cheaper than creating an OS thread but if your jobs are long-lived then your userspace job scheduler is really just adding additional scheduling overhead on top of the overhead that the OS already has for thread scheduling and you would have been better with OS threads. If your M:N threading requires separate stack space for each job, there can be a sizeable overhead (this is why Rust abandoned M:N threading).

anon3_ 2 hours ago

Can you come up with some examples of when this would began to be noticeable to an end-user?

pcwalton 1 hour ago

If you're crossing the FFI boundary a lot, any overhead adds up quick. For example, drawing a bunch of small objects using Skia, performing lots of OpenGL? draw calls, allocating LLVM IR nodes, or calling a C memory allocator…

kjksf 3 hours ago

I did some cgo (Go's FFI) work. Go's FFI is vastly superior to JNI (which is in a league of its own when it comes to awfulness).

No FFI is really great. Go's isn't as good as C# or Luajit or Swift but better than Java or Python or Ruby or Lua FFI (where you have to write a wrapper for each function you want to expose to the language).

...

It's true that go is M:N and that it does have bearing on bridging some C code (in rare cases where C code only works when called on main thread).

However, gccgo has Go runtime written in C and compiled with gcc, so go's runtime isn't tied to one toolchain.

I don't know why Go designers chose to (mostly) avoid libc but it certainly is great for portability and ease of cross-compilation. If you take dependency on libc (which is different between Linux/Mac/Windows) you pretty much throw the ability of cross-compile and given Go's static compilation model would require bundling a C compiler (unlike runtime-based languages like Java, Python or Ruby, where only runtime has to be compiled on the proper OS, so that complexity is contained for the programs written in the language).

I don't see why do you ding Go in particular for being "it's own little world". Compared to C - of course. Compared to Java or Python or Elixir? Much less than them.

masklinn 13 hours ago

> better than Python or Ruby

Mmm? https://cffi.readthedocs.org/en/latest/overview.html (basically a port of luajit's ffi) http://ruby-doc.org/stdlib-2.0.0/libdoc/fiddle/rdoc/Fiddle.h... (ruby DSL to libffi)

pcwalton 58 minutes ago

cgo-style stack switching (which I assume BEAM also uses) adds a lot of overhead at runtime, which Java and Python don't need since they're 1:1.

The speed of the FFI really affects how a language ecosystem uses it; if it's a lot slower to call out to external libraries than to call code written in the same language, then there's a large incentive to rewrite all dependencies in the language as opposed to using what's already there. Sun's libraries are a bit of a special case in that Sun really tried to rewrite everything for strategic/political reasons, but look at Android; the heavy lifting in the Android stack is done by Skia, OpenGL?, and Blink/WebKit? (to name a few), a strategy which works because JNI is relatively fast. Python also heavily favors using C libraries where appropriate, again because Python C bindings are fast.

I don't understand the issue about cross-compilation. You don't need a cross-compiler to statically link against native libraries; you just need a binary library to link to and a cross-linker (which can be language-independent). And, of course, if you dynamically link, you don't even need that much.

I'm not really trying to ding Golang, in any case. M:N scheduling has benefits as well as drawbacks. FFI is one of the downsides. There are upsides, such as fast thread spawning. It's a tradeoff.

beagle3 14 hours ago

> It really forced people to write java all the way. This meant the JVMs could really evolve unlike the Python and Ruby ones.

Java's FFI is horrible; Python and Ruby are mediocre. LuaJIT?2's is fantastic. Not so surprisingly, Python ate Java's lunch in places like scientific computing, where it is much more beneficial to build on existing work.

Python is hard to dethrone from that spot right now because of momentum, mostly - but if the competition was started again, I'm sure LuaJIT?2 would take the crown (Torch7 is based on it, but that's the only one I know).

I think my bottom line is: If you want your VM environment to be self sufficient, have horrible FFI like Java. If you want your VM environment to thrive with existing codebases, you have to have at least a mediocre one like Pythons. But you can have the best of all worlds like LuaJIT?[2] - and that's who Oracle should be copying.

pjmlp 13 hours ago

I think Python will loose momentum as soon as Julia gets more adoption, likewise with languages like Go and ML derivatives. Unless PyPy? gets more widespread that is,

Upcoming Java's FFI is based on JNR, an evolution of JNA, used by JRuby for native FFI.

Nevertheless everyone seems to have forgotten about CNI, implemented on GCJ, which mapped C++ classes directly to Java ones.