re: go: "In addition, interacting with popular libraries (such as libsdl or even OpenGL?) that use thread-local variables (TLS) means using ugly workarounds like this one:
http://code.google.com/p/go-wiki/wiki/LockOSThread "
" Some libraries, especially graphical frameworks/libraries like Cocoa, OpenGL?, libSDL all require it's called from the main OS thread or called from the same OS thread due to its use of thread local data structures. Go's runtime provides LockOSThread?() function for this, but it's notoriously difficult to use correctly. " -- see http://code.google.com/p/go-wiki/wiki/LockOSThread for solution code
--
random blog post, havent skimmed:
http://www.knewton.com/tech/blog/2012/10/java-scala-interoperability/
--
--
for interop with low-level stuff:
note that fixed-length fields within a record can be represented by an annotation overlay (node labels of a certain type/a certain label-label) on top of a an array of 'byte's
--
should make sure that
(a) calling Java method, or calling a Oot method from Java, are concise (b) passing/converting basic oot data into a Java list, or passing/converting a Java lisp into a oot list, is concise
--
ability to "do things like "mmap this file and return me an array of ComplicatedObject?[]" instances" (from https://news.ycombinator.com/item?id=6425412)
--
" If you want to call Scala from Java and have it look nice your Scala can't rely on Scala-specific language features such as implicit conversions, implicit arguments, default arguments, symbolic method names, by-name parameters, etc. Generics should be kept relatively simple as well. If you're Scala objects are relatively simple then there shouldn't be any problem using them from Java. – Erik Engbrecht May 21 '11 at 13:09' "
--
need to support some sort of GOTO in order to allow fast interpreters without dropping to assembly (because you need to help the hardware's branch predictor; the structured programming transformation in which there is a single 'switch' that maps between basic blocks means that the hardware's branch predictor tries to predict just by something similar to measuring the frequency distribution of landing on the targets to that switch, which of course is all over the place; what you want is to have many branch/goto instructions at various places in the code, some of which have non-uniform patterns in the frequency distribution of their targets, so that the branch predictor can get the distribution for each one separately and speculatively branch while you are in the code leading up to the goto; apparently this branch misprediction accounts for a large proportion of the slowdown between actual assembly and software-emulated assembly written in higher-levels structured-programming languages like C, according to the following blog post)
http://www.emulators.com/docs/nx25_nostradamus.htm
" the writing is now on the wall pointing the way toward simpler, scaled down CPU cores that bring back decades olds concepts such as in-order execution, and the use of binary translation to offload complex instructions from hardware into software.
For this anniversary posting, I am going to tackle the one giant gaping hole of my argument that I haven't touched on so far. Over the past few months I've demonstrated how straightforward it is to implement many aspects of a virtual machine in a portable and efficient manner - how to simulate guest conditional flags without explicit use of the host processor's flags register, how to handle byte-swapping and endianness differences without an explicit byte-swapping instruction on the host, how to perform safe guest-to-host memory access translation and security checks without need for a host MMU or hardware virtualization hardware, and how to optimize away most of the branch mispredictions of a typical CPU intepreter such as to achieve purely-interpreted simulation speed levels of 100 guest MIPS or faster.
But as I hinted at last week, there is still one area I haven't explored with you, and that is the crucial indirection at the heart of any interpreter loop; the indirect call or indirect jump which directs the interpreter to the next guest instruction's handler. An indirection that by design is doomed to always mispredict and thus severely limit the maximum speed of any interpreter. As the data that Stanislav and I presented at ISCA shows, the speed ratio between purely interpreted Bochs and purely jitted QEMU is almost exactly due to the extra cost of a branch misprediction on every guest x86 instruction simulated. Eliminate or reduce the rate of that branch misprediction, and you can almost close the performance gap between an interpreter and a jitter, and thus debunk one of the greatest virtual machine myths of all - the blind faith in the use of jitting as a performance accelerator.
I will first show why this misprediction happens and why today's C/C++ compilers and microprocessors are missing one very obvious optimization that could make this misprediction go away. Then I will show you the evolution of what I call the Nostradamus Distributor, an interpreter dispatch mechanism that reduces most of the mispredictions of the inner CPU loop by helping the host CPU predict the address of the next guest instruction's handler. A form of this mechanism was already implemented in the Gemulator 9 Beta 4 release posted a few months ago, and what I will describe is the more general C/C++ based portable implementation that I plan to test out in Bochs and use in the portable C implementation of Gemulator.
...
The Common CPU Interpreter Loop Revisited
I introduced you to a basic CPU interpreter loop last October in Part 7, which I will now bore you with again for the last time:
void Emulate6502() { register short unsigned int PC, SP, addr; register unsigned char A, X, Y, P; unsigned char memory[65536];
memset(memory, 0, 65536);
load_rom(); /* set initial power-on values */
A = X = Y = P = 0;
SP = 0x1FF;
PC = peekw(0xFFFC); for(;;)
{
switch(peekb(PC++))
{
default: /* undefined opcode! treat as nop */
case opNop:
break; case opIncX:
X++;
break; case opLdaAbs16:
addr = peekw(PC);
PC += 2;
A = peekb(addr);
break;... } } }
This is a hypothetical piece of sample code that could be used as a template to implement at 6502 interpreter.
...
What I have never seen a C or C++ compiler do for such interpreter loop code is make one further optimization. What is if instead of generating the jump instruction to the top of the "for" loop the compiler was smart enough to simply compile in the fetch and dispatch into each handler? In other words, if there was some kind of funky "goto" keyword syntax that would allow you to write this in your source code to hint to the compiler to do that:
switch(peekb(PC++))
{
default: /* undefined opcode! treat as nop */
case opNop:
goto case(peekb(PC++)); case opIncX:
X++;
goto case(peekb(PC++)); case opLdaAbs16:
addr = peekw(PC);
PC += 2;
A = peekb(addr);
goto case(peekb(PC++));You would now have an interpreter loop that simply jumped from instruction handler to instruction handler without even looping. Unless I missed something obvious, C and C++ lack the syntax to specify this design pattern, and the optimizing compilers don't catch it. It is for this reason that for all of my virtual machine projects over the past 20+ years I have resorted to using assembly language to implement CPU interpreters. Because in assembly language you can have a calculated jump target that you branch to from the end of each handler, as this x86 example code which represents the typical instruction dispatch code used in most past versions of Gemulator and SoftMac? and Fusion PC:
movzx ebx,word ptr gs:[esi]
add esi,2
jmp fs:[ebx*4]The 68040 guest program counter is in the ESI register, the 68040 opcode is loaded into EBX, the 68040 program counter is then incremented, and then the opcode is dispatched using an indirect jump. In Fusion PC the GS and FS segment registers are used to point to guest RAM and the dispatch table respectively, while in Gemulator and SoftMac?