ideas-computer-jasper-jasperInteropNotes1

re: go: "In addition, interacting with popular libraries (such as libsdl or even OpenGL?) that use thread-local variables (TLS) means using ugly workarounds like this one:

http://code.google.com/p/go-wiki/wiki/LockOSThread "

" Some libraries, especially graphical frameworks/libraries like Cocoa, OpenGL?, libSDL all require it's called from the main OS thread or called from the same OS thread due to its use of thread local data structures. Go's runtime provides LockOSThread?() function for this, but it's notoriously difficult to use correctly. " -- see http://code.google.com/p/go-wiki/wiki/LockOSThread for solution code

--

random blog post, havent skimmed:

http://www.knewton.com/tech/blog/2012/10/java-scala-interoperability/

--

--

for interop with low-level stuff:

note that fixed-length fields within a record can be represented by an annotation overlay (node labels of a certain type/a certain label-label) on top of a an array of 'byte's

--

should make sure that

(a) calling Java method, or calling a Jasper method from Java, are concise (b) passing/converting basic jasper data into a Java list, or passing/converting a Java lisp into a jasper list, is concise

--

ability to "do things like "mmap this file and return me an array of ComplicatedObject?[]" instances" (from https://news.ycombinator.com/item?id=6425412)

--

" If you want to call Scala from Java and have it look nice your Scala can't rely on Scala-specific language features such as implicit conversions, implicit arguments, default arguments, symbolic method names, by-name parameters, etc. Generics should be kept relatively simple as well. If you're Scala objects are relatively simple then there shouldn't be any problem using them from Java. – Erik Engbrecht May 21 '11 at 13:09' "

--

need to support some sort of GOTO in order to allow fast interpreters without dropping to assembly (because you need to help the hardware's branch predictor; the structured programming transformation in which there is a single 'switch' that maps between basic blocks means that the hardware's branch predictor tries to predict just by something similar to measuring the frequency distribution of landing on the targets to that switch, which of course is all over the place; what you want is to have many branch/goto instructions at various places in the code, some of which have non-uniform patterns in the frequency distribution of their targets, so that the branch predictor can get the distribution for each one separately and speculatively branch while you are in the code leading up to the goto; apparently this branch misprediction accounts for a large proportion of the slowdown between actual assembly and software-emulated assembly written in higher-levels structured-programming languages like C, according to the following blog post)

http://www.emulators.com/docs/nx25_nostradamus.htm

" the writing is now on the wall pointing the way toward simpler, scaled down CPU cores that bring back decades olds concepts such as in-order execution, and the use of binary translation to offload complex instructions from hardware into software.

For this anniversary posting, I am going to tackle the one giant gaping hole of my argument that I haven't touched on so far. Over the past few months I've demonstrated how straightforward it is to implement many aspects of a virtual machine in a portable and efficient manner - how to simulate guest conditional flags without explicit use of the host processor's flags register, how to handle byte-swapping and endianness differences without an explicit byte-swapping instruction on the host, how to perform safe guest-to-host memory access translation and security checks without need for a host MMU or hardware virtualization hardware, and how to optimize away most of the branch mispredictions of a typical CPU intepreter such as to achieve purely-interpreted simulation speed levels of 100 guest MIPS or faster.

But as I hinted at last week, there is still one area I haven't explored with you, and that is the crucial indirection at the heart of any interpreter loop; the indirect call or indirect jump which directs the interpreter to the next guest instruction's handler. An indirection that by design is doomed to always mispredict and thus severely limit the maximum speed of any interpreter. As the data that Stanislav and I presented at ISCA shows, the speed ratio between purely interpreted Bochs and purely jitted QEMU is almost exactly due to the extra cost of a branch misprediction on every guest x86 instruction simulated. Eliminate or reduce the rate of that branch misprediction, and you can almost close the performance gap between an interpreter and a jitter, and thus debunk one of the greatest virtual machine myths of all - the blind faith in the use of jitting as a performance accelerator.

I will first show why this misprediction happens and why today's C/C++ compilers and microprocessors are missing one very obvious optimization that could make this misprediction go away. Then I will show you the evolution of what I call the Nostradamus Distributor, an interpreter dispatch mechanism that reduces most of the mispredictions of the inner CPU loop by helping the host CPU predict the address of the next guest instruction's handler. A form of this mechanism was already implemented in the Gemulator 9 Beta 4 release posted a few months ago, and what I will describe is the more general C/C++ based portable implementation that I plan to test out in Bochs and use in the portable C implementation of Gemulator.

...

The Common CPU Interpreter Loop Revisited

I introduced you to a basic CPU interpreter loop last October in Part 7, which I will now bore you with again for the last time:

void Emulate6502() { register short unsigned int PC, SP, addr; register unsigned char A, X, Y, P; unsigned char memory[65536];

    memset(memory, 0, 65536);
    load_rom();
    /* set initial power-on values */
    A = X = Y = P = 0;
    SP = 0x1FF;
    PC = peekw(0xFFFC);
    for(;;)
        {
        switch(peekb(PC++))
            {
        default: /* undefined opcode! treat as nop */
        case opNop:
            break;
        case opIncX:
            X++;
            break;
        case opLdaAbs16:
            addr = peekw(PC);
            PC += 2;
            A = peekb(addr);
            break;

... } } }

This is a hypothetical piece of sample code that could be used as a template to implement at 6502 interpreter.

...

What I have never seen a C or C++ compiler do for such interpreter loop code is make one further optimization. What is if instead of generating the jump instruction to the top of the "for" loop the compiler was smart enough to simply compile in the fetch and dispatch into each handler? In other words, if there was some kind of funky "goto" keyword syntax that would allow you to write this in your source code to hint to the compiler to do that:

        switch(peekb(PC++))
            {
        default: /* undefined opcode! treat as nop */
        case opNop:
            goto case(peekb(PC++));
        case opIncX:
            X++;
            goto case(peekb(PC++));
        case opLdaAbs16:
            addr = peekw(PC);
            PC += 2;
            A = peekb(addr);
            goto case(peekb(PC++));

You would now have an interpreter loop that simply jumped from instruction handler to instruction handler without even looping. Unless I missed something obvious, C and C++ lack the syntax to specify this design pattern, and the optimizing compilers don't catch it. It is for this reason that for all of my virtual machine projects over the past 20+ years I have resorted to using assembly language to implement CPU interpreters. Because in assembly language you can have a calculated jump target that you branch to from the end of each handler, as this x86 example code which represents the typical instruction dispatch code used in most past versions of Gemulator and SoftMac? and Fusion PC:

    movzx ebx,word ptr gs:[esi]
    add esi,2
    jmp fs:[ebx*4]

The 68040 guest program counter is in the ESI register, the 68040 opcode is loaded into EBX, the 68040 program counter is then incremented, and then the opcode is dispatched using an indirect jump. In Fusion PC the GS and FS segment registers are used to point to guest RAM and the dispatch table respectively, while in Gemulator and SoftMac? there were explicit 32-bit address displacements used. But the mechanism is the same and you will see this pattern in many other interpreters.

The nice thing about handler chaining is that it has a beneficial side-effect! Not only does it eliminate a jump back to the top of a loop, by spreading out the indirect jumps from one central point and into each of the handlers the host CPU how has dozens if not hundreds of places that is it dispatch from. You might say to yourself this is bad, I mean, this bloats the size of the interpreter's code and puts an extra strain on the host CPU's branch predictor, no?

Yes! But, here is the catch. Machine language opcodes tend to follow patterns. Stack pushes are usually followed by a call instruction. Pops are usually followed by a return instruction. A memory load instruction is usually followed by a memory store instruction. A compare is followed by a conditional jump (usually a Jump If Zero). Especially with compiled code, you will see patterns of instructions repeating over and over again. That means that if you are executing the handler for the compare instruction, chances are very good that they next guest instruction is a conditional jump. Patterns like this will no doubt make up a huge portion of the guest code being interpreted, and so what happens is that the host CPU's branch predictor will start to correctly predict the jump targets from one handler to another.

...

gcc's computed goto:

http://web.archive.org/web/20100130194117/http://blogs.sun.com/nike/entry/fast_interpreter_using_gcc_s

"

--

incidentially the above blog is recommended by another random blog:

" vx32, an unconventional (but not exactly new) approach to virtualization: segmentation registers and instruction translation. An excellent introduction to this is provided by No Execute , and if you are interested in creating or working with virtual machines (from interpreters to emulators), I can’t recommend No Execute enough. " http://www.emulators.com/nx_toc.htm

--

" The Inferno shell is one of the most interesting features, however. It’s based on the rc shell by none other than Tom Duff of Duff’s Device fame. rc is a great shell, especially for scripting, runs almost everywhere, and is the default shell for Plan 9. Inferno’s version introduces an FFI to Limbo. Re-read that sentence if your jaw and the floor haven’t connected yet.

Through the Inferno shell builtin “load”, you can load modules that expand the shell’s builtins. For example, Inferno comes with a module to add regex support to the shell, one to add CSV support, and another to add support for Tk . I’ve not seen a feature quite like this in a shell, although I mostly stick to bash or rc, not having tried out the slightly more exotic ksh or zsh shells, which for all I know also have that feature. "

--

" Structures in Go are laid out in memory identically to the same in C, allowing for potential zero copy use on either side (made possible given that addresses of data are fixed in Go — it is not a compacting GC, and doesn’t need the normal pinning functionality or expensive boundary crossing often seen in GC VMs like Java or .NET). There are concerns regarding who frees memory, and the lifetime of passed objects that *are* bound to be GCs, but those are topics for another time. "

example: http://dennisforbes.ca/index.php/2013/07/31/demonstrating-gos-easy-c-interop/

http://golang.org/cmd/cgo/

pointers are available however no pointer arithmetic: http://golang.org/doc/faq#no_pointer_arithmetic --

" Actually, the hardest problem was getting the instrumentation agent to identify suspendable Clojure functions. This is quite easy with Java Quasar code as suspendable methods declare themselves as throwing a special checked exception. The Java compiler then helps ensure that any method calling a suspendable method must itself be declared suspendable. But Clojure doesn’t have checked exceptions. I though of using an annotation, but that didn’t work, and skimming through the Clojure compiler’s code proved that it’s not supported (though this feature could be added to the compiler very easily). In fact, it turns out you can’t mark the class generated by the Clojure compiler for each plain Clojure function in any sensible way that could be then detected by the instrumentation agent. Then I realized it wouldn’t have mattered because Clojure sometimes generates more than one class per function.

I ended up on notifying the instrumentation agent after the function’s class has been defined, and then retransforming the class bytecode in memory. Also, because all Clojure function calls are done via an interface (IFn), there is no easy way to recognize calls to suspendable functions in order to inject stack management code at the call-site. An easy solution was just to assume that any call to a Clojure function from within a suspendable function is a call to a suspendable function (although it adversely affects performance; we might come up with a better solution in future releases). "

---

this clojure library looks like the sort of thing i was thinking about! :

https://github.com/ztellman/gloss

---

" Bitfields would be really handy when writing a device driver. A basic example of the difference they would make is "if (reg.field == VAL)" vs. either "if (reg & MASK == VAL)" or "if (GET_FIELD(reg) == MASK)". But you can't use them for that purpose because their layout is implementation defined.

There would be a noteworthy performance benefit to pay for portable bitfields, but I think it would be worth it. Right now I have to write ugly code if I want any attempt at portability (always). I'd much rather have to write ugly code where I need speed (sometimes).

I'd love to hear if anyone else has any ideas on this matter (BTW it seems like D solves many of the problems I've been thinking about, but it's memory managed). "

---

http://pyjnius.readthedocs.org/en/latest/

---

go's C interop

"Go has a foreign function interface to C, but it receives only a cursory note on the home page. This is unfortunate, because the FFI works pretty darn well. You pass a C header to the "cgo" tool, and it generates Go code (types, functions, etc.) that reflects the C code (but only the code that's actually referenced). C constants get reflected into Go constants, and the generated Go functions are stubby and just call into the C functions.

The cgo tool failed to parse my system's ncurses headers, but it worked quite well for a different C library I tried, successfully exposing enums, variables, and functions. Impressive stuff.

Where it falls down is function pointers: it is difficult to use a C library that expects you to pass it a function pointer. I struggled with this for an entire afternoon before giving up. Ostsol got it to work through, by his own description, three levels of indirection. " -- http://ridiculousfish.com/blog/posts/go_bloviations.html#go_ccompatibility

--

" Instead, the OCaml created a pair of pipes and spawned a Python subprocess. You need to use two pipes, not a single socket, because Windows doesn't support Unix sockets. "