proj-oot-old-150618-ootInteropNotes1

re: go: "In addition, interacting with popular libraries (such as libsdl or even OpenGL?) that use thread-local variables (TLS) means using ugly workarounds like this one:

http://code.google.com/p/go-wiki/wiki/LockOSThread "

" Some libraries, especially graphical frameworks/libraries like Cocoa, OpenGL?, libSDL all require it's called from the main OS thread or called from the same OS thread due to its use of thread local data structures. Go's runtime provides LockOSThread?() function for this, but it's notoriously difficult to use correctly. " -- see http://code.google.com/p/go-wiki/wiki/LockOSThread for solution code

--

random blog post, havent skimmed:

http://www.knewton.com/tech/blog/2012/10/java-scala-interoperability/

--

--

for interop with low-level stuff:

note that fixed-length fields within a record can be represented by an annotation overlay (node labels of a certain type/a certain label-label) on top of a an array of 'byte's

--

should make sure that

(a) calling Java method, or calling a Oot method from Java, are concise (b) passing/converting basic oot data into a Java list, or passing/converting a Java lisp into a oot list, is concise

--

ability to "do things like "mmap this file and return me an array of ComplicatedObject?[]" instances" (from https://news.ycombinator.com/item?id=6425412)

--

" If you want to call Scala from Java and have it look nice your Scala can't rely on Scala-specific language features such as implicit conversions, implicit arguments, default arguments, symbolic method names, by-name parameters, etc. Generics should be kept relatively simple as well. If you're Scala objects are relatively simple then there shouldn't be any problem using them from Java. – Erik Engbrecht May 21 '11 at 13:09' "

--

need to support some sort of GOTO in order to allow fast interpreters without dropping to assembly (because you need to help the hardware's branch predictor; the structured programming transformation in which there is a single 'switch' that maps between basic blocks means that the hardware's branch predictor tries to predict just by something similar to measuring the frequency distribution of landing on the targets to that switch, which of course is all over the place; what you want is to have many branch/goto instructions at various places in the code, some of which have non-uniform patterns in the frequency distribution of their targets, so that the branch predictor can get the distribution for each one separately and speculatively branch while you are in the code leading up to the goto; apparently this branch misprediction accounts for a large proportion of the slowdown between actual assembly and software-emulated assembly written in higher-levels structured-programming languages like C, according to the following blog post)

http://www.emulators.com/docs/nx25_nostradamus.htm

" the writing is now on the wall pointing the way toward simpler, scaled down CPU cores that bring back decades olds concepts such as in-order execution, and the use of binary translation to offload complex instructions from hardware into software.

For this anniversary posting, I am going to tackle the one giant gaping hole of my argument that I haven't touched on so far. Over the past few months I've demonstrated how straightforward it is to implement many aspects of a virtual machine in a portable and efficient manner - how to simulate guest conditional flags without explicit use of the host processor's flags register, how to handle byte-swapping and endianness differences without an explicit byte-swapping instruction on the host, how to perform safe guest-to-host memory access translation and security checks without need for a host MMU or hardware virtualization hardware, and how to optimize away most of the branch mispredictions of a typical CPU intepreter such as to achieve purely-interpreted simulation speed levels of 100 guest MIPS or faster.

But as I hinted at last week, there is still one area I haven't explored with you, and that is the crucial indirection at the heart of any interpreter loop; the indirect call or indirect jump which directs the interpreter to the next guest instruction's handler. An indirection that by design is doomed to always mispredict and thus severely limit the maximum speed of any interpreter. As the data that Stanislav and I presented at ISCA shows, the speed ratio between purely interpreted Bochs and purely jitted QEMU is almost exactly due to the extra cost of a branch misprediction on every guest x86 instruction simulated. Eliminate or reduce the rate of that branch misprediction, and you can almost close the performance gap between an interpreter and a jitter, and thus debunk one of the greatest virtual machine myths of all - the blind faith in the use of jitting as a performance accelerator.

I will first show why this misprediction happens and why today's C/C++ compilers and microprocessors are missing one very obvious optimization that could make this misprediction go away. Then I will show you the evolution of what I call the Nostradamus Distributor, an interpreter dispatch mechanism that reduces most of the mispredictions of the inner CPU loop by helping the host CPU predict the address of the next guest instruction's handler. A form of this mechanism was already implemented in the Gemulator 9 Beta 4 release posted a few months ago, and what I will describe is the more general C/C++ based portable implementation that I plan to test out in Bochs and use in the portable C implementation of Gemulator.

...

The Common CPU Interpreter Loop Revisited

I introduced you to a basic CPU interpreter loop last October in Part 7, which I will now bore you with again for the last time:

void Emulate6502() { register short unsigned int PC, SP, addr; register unsigned char A, X, Y, P; unsigned char memory[65536];

    memset(memory, 0, 65536);
    load_rom();
    /* set initial power-on values */
    A = X = Y = P = 0;
    SP = 0x1FF;
    PC = peekw(0xFFFC);
    for(;;)
        {
        switch(peekb(PC++))
            {
        default: /* undefined opcode! treat as nop */
        case opNop:
            break;
        case opIncX:
            X++;
            break;
        case opLdaAbs16:
            addr = peekw(PC);
            PC += 2;
            A = peekb(addr);
            break;

... } } }

This is a hypothetical piece of sample code that could be used as a template to implement at 6502 interpreter.

...

What I have never seen a C or C++ compiler do for such interpreter loop code is make one further optimization. What is if instead of generating the jump instruction to the top of the "for" loop the compiler was smart enough to simply compile in the fetch and dispatch into each handler? In other words, if there was some kind of funky "goto" keyword syntax that would allow you to write this in your source code to hint to the compiler to do that:

        switch(peekb(PC++))
            {
        default: /* undefined opcode! treat as nop */
        case opNop:
            goto case(peekb(PC++));
        case opIncX:
            X++;
            goto case(peekb(PC++));
        case opLdaAbs16:
            addr = peekw(PC);
            PC += 2;
            A = peekb(addr);
            goto case(peekb(PC++));

You would now have an interpreter loop that simply jumped from instruction handler to instruction handler without even looping. Unless I missed something obvious, C and C++ lack the syntax to specify this design pattern, and the optimizing compilers don't catch it. It is for this reason that for all of my virtual machine projects over the past 20+ years I have resorted to using assembly language to implement CPU interpreters. Because in assembly language you can have a calculated jump target that you branch to from the end of each handler, as this x86 example code which represents the typical instruction dispatch code used in most past versions of Gemulator and SoftMac? and Fusion PC:

    movzx ebx,word ptr gs:[esi]
    add esi,2
    jmp fs:[ebx*4]

The 68040 guest program counter is in the ESI register, the 68040 opcode is loaded into EBX, the 68040 program counter is then incremented, and then the opcode is dispatched using an indirect jump. In Fusion PC the GS and FS segment registers are used to point to guest RAM and the dispatch table respectively, while in Gemulator and SoftMac? there were explicit 32-bit address displacements used. But the mechanism is the same and you will see this pattern in many other interpreters.

The nice thing about handler chaining is that it has a beneficial side-effect! Not only does it eliminate a jump back to the top of a loop, by spreading out the indirect jumps from one central point and into each of the handlers the host CPU how has dozens if not hundreds of places that is it dispatch from. You might say to yourself this is bad, I mean, this bloats the size of the interpreter's code and puts an extra strain on the host CPU's branch predictor, no?

Yes! But, here is the catch. Machine language opcodes tend to follow patterns. Stack pushes are usually followed by a call instruction. Pops are usually followed by a return instruction. A memory load instruction is usually followed by a memory store instruction. A compare is followed by a conditional jump (usually a Jump If Zero). Especially with compiled code, you will see patterns of instructions repeating over and over again. That means that if you are executing the handler for the compare instruction, chances are very good that they next guest instruction is a conditional jump. Patterns like this will no doubt make up a huge portion of the guest code being interpreted, and so what happens is that the host CPU's branch predictor will start to correctly predict the jump targets from one handler to another.

...

gcc's computed goto:

http://web.archive.org/web/20100130194117/http://blogs.sun.com/nike/entry/fast_interpreter_using_gcc_s

"

--

incidentially the above blog is recommended by another random blog:

" vx32, an unconventional (but not exactly new) approach to virtualization: segmentation registers and instruction translation. An excellent introduction to this is provided by No Execute , and if you are interested in creating or working with virtual machines (from interpreters to emulators), I can’t recommend No Execute enough. " http://www.emulators.com/nx_toc.htm

--

" The Inferno shell is one of the most interesting features, however. It’s based on the rc shell by none other than Tom Duff of Duff’s Device fame. rc is a great shell, especially for scripting, runs almost everywhere, and is the default shell for Plan 9. Inferno’s version introduces an FFI to Limbo. Re-read that sentence if your jaw and the floor haven’t connected yet.

Through the Inferno shell builtin “load”, you can load modules that expand the shell’s builtins. For example, Inferno comes with a module to add regex support to the shell, one to add CSV support, and another to add support for Tk . I’ve not seen a feature quite like this in a shell, although I mostly stick to bash or rc, not having tried out the slightly more exotic ksh or zsh shells, which for all I know also have that feature. "

--

" Structures in Go are laid out in memory identically to the same in C, allowing for potential zero copy use on either side (made possible given that addresses of data are fixed in Go — it is not a compacting GC, and doesn’t need the normal pinning functionality or expensive boundary crossing often seen in GC VMs like Java or .NET). There are concerns regarding who frees memory, and the lifetime of passed objects that *are* bound to be GCs, but those are topics for another time. "

example: http://dennisforbes.ca/index.php/2013/07/31/demonstrating-gos-easy-c-interop/

http://golang.org/cmd/cgo/

pointers are available however no pointer arithmetic: http://golang.org/doc/faq#no_pointer_arithmetic --

" Actually, the hardest problem was getting the instrumentation agent to identify suspendable Clojure functions. This is quite easy with Java Quasar code as suspendable methods declare themselves as throwing a special checked exception. The Java compiler then helps ensure that any method calling a suspendable method must itself be declared suspendable. But Clojure doesn’t have checked exceptions. I though of using an annotation, but that didn’t work, and skimming through the Clojure compiler’s code proved that it’s not supported (though this feature could be added to the compiler very easily). In fact, it turns out you can’t mark the class generated by the Clojure compiler for each plain Clojure function in any sensible way that could be then detected by the instrumentation agent. Then I realized it wouldn’t have mattered because Clojure sometimes generates more than one class per function.

I ended up on notifying the instrumentation agent after the function’s class has been defined, and then retransforming the class bytecode in memory. Also, because all Clojure function calls are done via an interface (IFn), there is no easy way to recognize calls to suspendable functions in order to inject stack management code at the call-site. An easy solution was just to assume that any call to a Clojure function from within a suspendable function is a call to a suspendable function (although it adversely affects performance; we might come up with a better solution in future releases). "

---

this clojure library looks like the sort of thing i was thinking about! :

https://github.com/ztellman/gloss

---

" Bitfields would be really handy when writing a device driver. A basic example of the difference they would make is "if (reg.field == VAL)" vs. either "if (reg & MASK == VAL)" or "if (GET_FIELD(reg) == MASK)". But you can't use them for that purpose because their layout is implementation defined.

There would be a noteworthy performance benefit to pay for portable bitfields, but I think it would be worth it. Right now I have to write ugly code if I want any attempt at portability (always). I'd much rather have to write ugly code where I need speed (sometimes).

I'd love to hear if anyone else has any ideas on this matter (BTW it seems like D solves many of the problems I've been thinking about, but it's memory managed). "

---

http://pyjnius.readthedocs.org/en/latest/

---

go's C interop

"Go has a foreign function interface to C, but it receives only a cursory note on the home page. This is unfortunate, because the FFI works pretty darn well. You pass a C header to the "cgo" tool, and it generates Go code (types, functions, etc.) that reflects the C code (but only the code that's actually referenced). C constants get reflected into Go constants, and the generated Go functions are stubby and just call into the C functions.

The cgo tool failed to parse my system's ncurses headers, but it worked quite well for a different C library I tried, successfully exposing enums, variables, and functions. Impressive stuff.

Where it falls down is function pointers: it is difficult to use a C library that expects you to pass it a function pointer. I struggled with this for an entire afternoon before giving up. Ostsol got it to work through, by his own description, three levels of indirection. " -- http://ridiculousfish.com/blog/posts/go_bloviations.html#go_ccompatibility

--

" Instead, the OCaml created a pair of pipes and spawned a Python subprocess. You need to use two pipes, not a single socket, because Windows doesn't support Unix sockets. "

---

this thread talks about Traceur being better than Babeljs for Javascript ES6 -> ES5 compilation (which ppl call "transpilation" for some odd reason)

https://news.ycombinator.com/item?id=9090958

"issues are addressed within a day and the author is an overall great guy"

"Traceur isn't readable at all (at least not to me) which might not matter, but I think in some cases, Babel's output is closer to more traditional Javascript and more performant"

"Traceur scared people away both because it was a build tool and had a runtime dependency. Now that tools like Webpack are common, introducing a build step at ay point in your workflow is trivial. Moreover, many of the JS Harmony improvements can be transpiled with a tool like Babel or jstransform without the need to introduce yet another library in your deployed code."

"

Bahamut 2 days ago

I like Traceur (at least with tools like System.js & jspm), especially with its support for optional typing and annotations, but for some apps, Babel makes a lot more sense.

I came across the difficulty of using jspm on the server and client for an isomorphic React app I am building for an online community I run, and I was advised to just use npm (& naturally Babel). Integrating browserify with babelify into the build process was a far easier task."

--

look at all the trouble i go thru with image_singlegenes. It should be easy to make an array of images in Python and then transfer them into the Octave code being run. A good glue language would handle that sort of thing.

that is, the stuff in eg imsave_via_octave should be almost builtin (contentful 'pipes' at least, but really, stacks and named variables should transfer, even mutable ones/references), and it should be easy to write stuff like image_singlegenes by calling it

--

" mooreds 17 hours ago

I haven't done much with the other JVM languages except play around with them (did work with a small jython project about 10 years ago...). Have you encountered any impedance mismatch? Or weird cross language bugs?

reply

infraruby 15 hours ago

JRuby converts values (Java primitives <-> ruby Numeric, java.lang.String <-> ruby String, etc.) sometimes with unexpected results.

JRuby does not wrap primitive values, or provide values that behave like primitives, but you can add that: https://rubygems.org/gems/infraruby-java

reply "

---

https://groups.google.com/forum/m/#!topic/golang-nuts/RwJaZh0nJA4 notes that the gold standard of embedding, "embedding "blindfolded" Go code in C, by compiling Go to object code, and then linking to that from C," won't work in Golang, because

" The biggest problem is the fact that Go has its own runtime, namely a garbage collector and a scheduler (for go routines) and segmented stacks that do not play well with C stacks. It's much more complicated than linking the executable code and translating between calling conventions (e.g. what happens to a goroutine created in Go code you called from C after Go code returns to C?)

Now, it's not impossible to overcome this e.g. V8 or Microsoft's .NET have embedding APIs that allow for decent way of calling JavaScript?/CLR code from C/C++ code, but that doesn't exist for any current implementation of Go (as far as I know) and doing it right is significant amount of work.

"

---

https://news.ycombinator.com/item?id=9376793

---

http://ebb.org/bkuhn/blog/2014/06/09/do-not-need-cla.html

---

apl 16 hours ago

The Julia FFI for Python is absolutely excellent -- calling a particular Python library from Julia takes very, very little effort. As a language for scientific programming, Julia is way ahead of Python. So I'm not sure if this particular argument holds much water.

https://github.com/stevengj/PyCall.jl

reply

---

weak references, and a way to receive a notification sometime after the garbage-collection of a weak reference, is rather necessary for interop: see https://news.ycombinator.com/item?id=9735973


munificent 5 hours ago

WebAssembly? still has a long way to go. They don't have a plan yet for:

I don't think any high level language will be able to compete with JS on an even playing field until that language can use the high performance GC that's already in every browser.

If your language has to either not use GC (huge productivity loss) or ship your own GC to the end user with your app (huge app size), it's an unfair fight.

reply

---

webassembly

---

spullara 4 hours ago

It is really too bad that at some point in the last 18 years of Java VMs being in browsers that they didn't formalize the connection between the DOM and Java so that you could write code that interacted directly with the DOM and vice/versa in a mature VM that was already included. Would have been way better than applets, way faster than Javascript and relatively easy to implement. The browsers actually have (had?) APIs for this but they were never really stabilized.

reply

hello_there 3 hours ago

I find it interesting that Java didn't become the standard for this as it seems like it has everything and is both fast and mature.

What might be the reason?

reply

titzer 1 hour ago

There are several important lessons to learn from the Java bytecode format and members of the WebAssembly? (including myself) do have experience here. In particular, JVM class files would be a poor fit for WebAssembly? because:

1. They impose Java's class and primitive type model. 2. They allow irreducible control flow. 3. They aren't very compact. Lots of redundancy in constant pools across classes and still a lot of possibility for compression. 4. Verification of JVM class files is an expensive operation requiring control and dataflow analysis (see stackmaps added in the Java 7 class file format for rationale). 5. No notion of low-level memory access. WebAssembly? specifically addresses this, exposing the notion of a native heap that can be bit-banged directly by applications.

reply

BrendanEich? 2 hours ago

See https://news.ycombinator.com/item?id=1894374 from @nix.

reagency 2 hours ago

Back when Java Applets were a thing, Sun wasn't friendly with browser makers. JavaScript? was a gimmicky alternative that was created by a browser manufacturer. It had the foothold, and it grew.

Nos Oracle isn't interested in Web.

reply

nix 1679 days ago

parent flag

My admittedly biased view: I spent two years of my life trying to make the JVM communicate gracefully with Javascript - there were plenty of us at Netscape who thought that bytecode was a better foundation for mobile code. But Sun made it very difficult, building their complete bloated software stack from scratch. They didn't want Java to cooperate with anything else, let alone make it embeddable into another piece of software. They wrote their string handling code in an interpreted language rather than taint themselves with C! As far as I can tell, Sun viewed Netscape - Java's only significant customer at the time - as a mere vector for their Windows replacement fantasies. Anybody who actually tried to use Java would just have to suffer.

Meanwhile Brendan was doing the work of ten engineers and three customer support people, and paying attention to things that mattered to web authors, like mixing JS code into HTML, instant loading, integration with the rest of the browser, and working with other browser vendors to make JS an open standard.

So now JS is the x86 assembler of the web - not as pretty as it might be, but it gets the job done (GWT is the most hilarious case in point). It would be a classic case of worse is better except that Java only looked better from the bottom up. Meanwhile JS turned out to be pretty awesome. Good luck trying to displace it.

SWF was the other interesting bytecode contender, but I don't know much about the history there. Microsoft's x86 virtualization tech was also pretty cool but they couldn't make it stick alone.