proj-oot-ootImplementationNotes3

(Simon Marlow on GHC:)

simonmar on Sept 4, 2010 [-]

Oh yes, immutability is crucial. Generational GC already makes you pay for mutation with a write barrier, and in our parallel GC we omit the locking when copying immutable objects, accepting that a few might get copied twice. In the new GC mutable objects become even more expensive. I don't think local GC is viable at all in a language with ubiquitous mutation.

---

Roboprog on Sept 3, 2010 [-]

An interesting approach: giving each thread its own "young generation" sub-heap, so transient objects can be disposed of without coordination from other threads / CPUs and their cache pages.

scott_s on Sept 3, 2010 [-]

I worked on a memory allocator (as in malloc, not garbage collection) that takes a similar approach: http://people.cs.vt.edu/~scschnei/streamflow/

A group at Intel independently came up with a similar approach as well: http://portal.acm.org/citation.cfm?id=1133967

Roboprog on Sept 6, 2010 [-]

Cool, the steamflow thing sounds interesting. Is there a top level example or test-driver somewhere in the github project showing what typical use-cases are?

E.g. - I have a test-driver here: http://github.com/roboprog/buzzard/blob/master/test/src/main... (although I have barely started the library I was tinkering on)

Any example client program for your allocator? I'd like to see what use cases you are handling.

scott_s on Sept 6, 2010 [-]

The Larson and Recycle benchmarks are on github. You can read about them in the paper. Email me if you'd like to see an unpublished paper which has some more detail on the allocator's design.

larsberg on Sept 3, 2010 [-]

We've been doing this in Manticore since 2008 or so. We couldn't really get speedups past 12 cores without it (we have a "typical" parallel GC implemented as well to test against). Hopefully we'll get the paper on the GC and associated language trickery - we don't allow pointers between young generations - in somewhere soon :)

simonmar on Sept 3, 2010 [-]

Yes, the GHC design has certainly been influenced by Manticore (that was one of the "other designs" I referred to). Though in GHC we do have some different problems to solve, the worst of which is that we have to support a bunch of programming abstractions that use mutation.

---

interesting (unimplemented?) PEP on Python bytecode verification: http://legacy.python.org/dev/peps/pep-0330/

---

some things we might want to use machine registers for when building an OVM interpreter in assembly:

---

http://www.dyncall.org/

"

dyncall library The dyncall library encapsulates architecture-, OS- and compiler-specific function call semantics in a virtual bind argument parameters from left to right and then call interface allowing programmers to call C functions in a completely dynamic manner. In other words, instead of calling a function directly, the dyncall library provides a mechanism to push the function parameters manually and to issue the call afterwards. This means, that a program can determine at runtime what function to call, and what parameters to pass to it. The library is written in C and assembly and provides a very simple C interface to program against. "

---

i think i already noted this somewhere but some of these comments are a good read:

https://news.ycombinator.com/item?id=10032295

---

" up vote 13 down vote accepted

There are a number of papers on different kinds of dispatch:

M. Anton Ertl and David Gregg, Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters, in Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI 03), pp. 278-288, San Diego, California, June 2003.

M. Anton Ertl and David Gregg, The behaviour of efficient virtual machine interpreters on modern architectures, in Proceedings of the 7th European Conference on Parallel Computing (Europar 2001), pp. 403-412, LNCS 2150, Manchester, August 2001.

An excellent summary is provided by Yunhe Shi in his PhD? thesis.

Also, someone discovered a new technique a few years ago which is valid ANSI C.

"

---

Doug's C# parser framework:

https://github.com/drubino/Linq.Parsers

---

"...can use FASM, which is self-hosting, if I can't compile NASM or YASM. " [1]

---

[2] [3]

" Context-free grammars. What this really means is the code should be parsable without having to look things up in a symbol table. C++ is famously not a context-free grammar. A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating most of a compiler front end. As a result, third-party tools become much more likely to exist. Redundancy. Yes, the grammar should be redundant. You've all heard people say that statement terminating ; are not necessary because the compiler can figure it out. That's true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages. "

"The first tool that beginning compiler writers often reach for is regex. Regex is just the wrong tool for lexing and parsing. Rob Pike explains why reasonably well."

" The philosophies of error message handling are:

    Print the first message and quit. This is, of course, the simplest approach, and it works surprisingly well. Most compilers' follow-on messages are so bad that the practical programmer ignores all but the first one anyway. The holy grail is to find all the actual errors in one compile pass, leading to:
    Guess what the programmer intended, repair the syntax trees, and continue. This is an ever-popular approach. I've tried it indefatigably for decades, and it's just been a miserable failure. The compiler seems to always guess wrong, and subsequent messages with the "fixed" syntax trees are just ludicrously wrong.
    The poisoning approach. This is much like how floating-point NaNs are handled. Any operation with a NaN operand silently results in a NaN. Applying this to error recovery, and any constructs that have a leaf for which an error occurred, is itself considered erroneous (but no additional error messages are emitted for it). Hence, the compiler is able to detect multiple errors as long as the errors are in sections of code with no dependency between them. This is the approach we've been using in the D compiler, and are very pleased with the results."

" Runtime Library

Rarely mentioned, but critical, is the need to write a runtime library. This is a major project. It will serve as a demonstration of how the language features work, so it had better be good. Some critical things to get right include:

    I/O performance. Most programs spend a lot of time in I/O. Slow I/O will make the whole language look bad. The benchmark is C stdio. If the language has elegant, lovely I/O APIs, but runs at only half the speed of C I/O, then it just isn't going to be attractive.
    Memory allocation. A high percentage of time in most programs is spent doing mundane memory allocation. Get this wrong at your peril.
    Transcendental functions. OK, I lied. Nobody cares about the accuracy of transcendental functions, they only care about their speed. My proof comes from trying to port the D runtime library to different platforms, and discovering that the underlying C transcendental functions often fail the accuracy tests in the D library test suite. C library functions also often do a poor job handling the arcana of the IEEE floating-point bestiary — NaNs, infinities, subnormals, negative 0, etc. In D, we compensated by implementing the transcendental functions ourselves. Transcendental floating-point code is pretty tricky and arcane to write, so I'd recommend finding an existing library you can license and adapting that.
    A common trap people fall into with standard libraries is filling them up with trivia. Trivia is sand clogging the gears and just dead weight that has to be carried around forever. My general rule is if the explanation for what the function does is more lines than the implementation code, then the function is likely trivia and should be booted out."

---

[4]

" String I/O should be unicode-aware & support utf-8. Binary I/O should exist. Console I/O is nice, and you should support it if only for the sake of having a REPL with readline-like features. Basically all of this can be done by making your built-in functions wrappers around the appropriate safe I/O functions from whatever language you’re building on top of (even C, although I wouldn’t recommend it). It’s no longer acceptable to expect strings to be zero-terminated rather than length-prefixed. It’s no longer acceptable to have strings default to ascii encoding instead of unicode. In addition to supporting unicode strings, you should also probably support byte strings, something like a list or array (preferably with nesting), and dictionaries/associative arrays. It’s okay to make your list type do double-duty as your stack and queue types and to make dictionaries act as classes and objects. Good support for ranges/spans on lists and strings is very useful. If you expect your language to do string processing, built-in regex is important. If you provide support for parallelism that’s easier to manage than mutexes, your developers will thank you. While implicit parallelism can be hard to implement in imperative languages (much easier in functional or pure-OO languages), even providing support for thread pools, a parallel map/apply function, or piping data between independent threads (like in goroutines or the unix shell) would help lower the bar for parallelism support. Make sure you have good support for importing third party packages/modules, both in your language and in some other language. Compiled languages should make it easy to write extensions in C (and you’ll probably be writing most of your built-ins this way anyway). If you’re writing your interpreted language in another interpreted language (as I did with Mycroft) then make sure you expose some facility to add built-in functions in that language. For any interpreted language, a REPL with a good built-in online help system is a must. Users who can’t even try out your language without a lot of effort will be resistant to using it at all, whereas a simple built-in help system can turn exploration of a new language into an adventure. Any documentation you have written for core or built-in features (including documentation on internal behavior) should be assessible from the REPL. This is easy to implement (see Mycroft’s implementation of online help) and is at least as useful for the harried language developer as for the new user. "

---

ufo 7 days ago [-]

Deoptimization is actually really hard to implement if you have an ahead of time compiler like Cannoli. You need to get all the stuff that is living in machine registers or in the C stack and then convert them back to whatever representation your generic interpreter uses.

I think this is actually one of the things that most get in the way if you want to use traditional AOT compiler technology (like gcc or LLVM) to implement a JIT. In state of the art JIT compilers this part is always a nest of highly complex and nonportable assembly language.

reply

---

tathougies 7 days ago [-]

Compiling to machine code is not a panacea for optimization. A optimized JIT compiler is going to blow an AOT compiler out of the water. Being smart about the machine code generated is significantly more important than generating machine code. In particular, PyPy? makes several optimizations over python code that a more direct implementation of CPython at the machine level probably wouldn't. For example, PyPy? erases dictionary lookups for object member access if the object shape is statically known. Given how prevalent this kind of lookup is in Python code, it's possible that even an interpreter that made this optimization would be faster than a machine code version that used an actual hash table.

I think this compiler also makes this particular optimization, but this is just one of many many optimizations PyPy? does. I imagine that with sufficient work, this compiler could be brought up to speed with PyPy?, but as it stands right now, PyPy? simply benefits from having years of optimization work that a new project doesn't.

reply

ori_b 7 days ago [-]

For most dynamic languages, the available speedups aren't in simple compilation, but in removing the runtime type checks, method lookups, and other slow operations. This needs the ability to guess at what the code is going to do based on past behavior, and generate specialized versions that get thrown away if the guesses are invalidated.

So, for example, you might see that the last 100 calls to a function were done with integers, so you can generate a variant of the function that only works for integers, and check if it's applicable when you enter the function. If that function stops getting used, you can throw it away.

Doing that well ahead of time requires an extremely good idea of how the program will behave at run time, and even with good information, is still very likely to bloat up your binary hugely. (Facebook used to compile their PHP codebase to a multi-gigabyte binary before moving to HHVM, for example).

reply

collyw 7 days ago [-]

Actually I think you answered a question that I already asked about Rust being faster than C. If you don't need to carry out as many checks then I see how that will speed things up.

reply

xapata 7 days ago [-]

JITs get to analyze both code and data and optimize for each machine deployed to. A static compiler can only analyze code and the machine used for compilation. If dependencies were pre-compiled, the static compiler won't be able to optimize their relationship with the project. If the machine is changed for deployment.

More information means better optimizations. JITs FTW.

reply

jcranmer 7 days ago [-]

JITs also tend to optimize for compilation time over final performance. Doing ahead-of-time superoptimization or polyhedral loop transformation isn't going to happen in a JIT.

reply

xapata 7 days ago [-]

There's no restriction on the kind of optimization a JIT can do. Perhaps there's a current implementation tendency. In contrast, ahead-of-time (data- and platform-ignorant) compilers are restricted.

reply

---

example of compilation to Rust:

https://github.com/joncatanio/cannoli https://github.com/joncatanio/cannoli/blob/master/resources/papers/cannoli-thesis-paper.pdf

---

" 4.1 Why Compile Python to Rust? We considered three different intermediate representations to compile Python code into. The first was LLVM [17], to leverage the many optimizations implemented by the LLVM compiler. Targeting LLVM, however, would require implementing a garbage collector (or simply ignoring memory management for this prototype). The implementation would be simplified by implementing a standard library to handle various elements of the language, but writing this library in LLVM was considered to not be ideal. Writing the library in C, compiled to LLVM code, would eliminate some of the complexity of writing the library directly in LLVM. Following this idea further we considered targeting C and compiling against a library written in C. Unfortunately, this does not address the issue of memory management. Targeting Rust and compiling the output Rust code against a library written in Rust was considered the best of both worlds. We get a large number of optimizations from the Rust compiler (and the LLVM compiler) as well as memory management provided by Rust’s ownership rules. Therefore, Cannoli is a Python compiler written in Rust that compiles Python code into Rust code. " -- [5]

---

chubot 23 days ago

parent [-]on: Python startup time: milliseconds matter

This is disappointing to me too, but I think there are some problems baked in to the language that make it hard.

The import code in CPython was a mess, which was apparently cleaned up by importlib in Python 3, through tremendous effort. But unfortunately I think importlib made things slower?

I recall a PyCon? talk where as of 3.6, essentially everything about Python 3 is now faster than Python 2, EXCEPT startup time!

This is a shame, because I would have switched to Python 3 for startup time ALONE. (As of now, most of my code and that of my former employer is Python 2.) That would have been the perfect time to address startup time, because getting a 2x-10x improvement (which is what's needed) requires breaking changes.

I don't think there's a lack of interest in the broader Python community, but there might be a lack of interest/manpower in the core team, which leads to the situation wonderfully summarized in the recent xkcd:

https://xkcd.com/1987/

FWIW I was the one who sent a patch to let Python run a .zip file back in 2007 or so, for Python 2.6 I think. This was roughly based on what we did at Google for self-contained applications. A core team member did a cleaner version of my patch, although this meant it was undocumented until Python 3.5 or so:

https://docs.python.org/3/library/zipapp.html

The .zip support at runtime was a start, but it's really the tooling that's a problem. And it's really the language that inhibits tooling.

Also, even if you distributed self-contained applications, the startup time is not great. It's improved a bit because you're "statting" a zip file rather than making syscalls, but it's still not great.

In other words, I have wondered about this "failure" for over a decade myself, and even tried to do something about it. I think the problem is that there are multiple parts to the solution, the responsibility for these parts is distributed. I hate to throw everything on the core team, but module systems and packaging are definitely a case where "distributed innovation" doesn't work. There has to be a central team setting standards that everyone else follows.

Also, it's not a trivial problem. Go is a static language and is doing better in this regard, but still people complain about packaging. (vgo is coming out after nearly a decade, etc.)

I should also add that while I think Python packaging is in the category of "barely works", I would say the same is true of Debian. And Debian is arguably the most popular Linux package manager. They're cases of "failure by success".

...

FWIW I think importing is heavily bottlenecked by I/O, in particular stat() of tons of "useless" files. In theory the C to Python change shouldn't have affected it much. But I haven't looked into it more deeply than that.

chubot 23 days ago [-]

EDIT: I should also add that the length of PYTHONPATH as constructed by many package managers is a huge problem. You're doing O(m*n) stat()s -- random disk access -- which is the slowest thing your computer can do.

m is the number of libraries you're importing, and n is the length of the PYTHONPATH.

So it gets really bad, and it's not just one person's "fault". It's a collusion between the Python interpreter's import logic and how package managers use it.

---

probably recursive descent with shunting yard subroutine for operator expression parsers (to handle precedence; pure recursive descent can't handle that [6], section 'The Operator Issue as of 1961')

instead of shunting yard, mb Pratt parsing (also called'precedence climbing') [7]

alternately, parse operator expressions as lists, then add associativity in postprocessing.

" In addition, while pure recursive descent cannot parse operator expressions, it can recognize them. This means pure recursive descent may not be able to create the parse subtree for an operator expression itself, but it can recognize the expression and hand control over to a specialized operator expression parser. This seems to be what Lucas' 1961 algorithm did, and it is certainly what many other implementations did afterwards. Adding the operator expression subparser makes the implementation only quasi-Chomskyan, but this was a price the profession has been willing to pay.

Alternatively, a recursive descent implementation can parse operator expressions as lists, and add associativity in post-processing. This pushes some of the more important parsing out of the syntactic phase into the semantics but, once again, it seemed that Chomskyan purity had to be thrown overboard if the ship was to stay afloat. " [8]

see also section https://jeffreykegler.github.io/personal/timeline_v3#h1-the_operator_issue_as_of_1968]

of course, the author of that suggests his parser, MARPA.

---

" Julia’s optimizer has gotten smarter in more ways than we can list here, but a few highlights are worth mentioning. The optimizer can now propagate constants through function calls, allowing much better dead-code elimination and static evaluation than before. The compiler is also much better at avoiding allocation of short-lived wrappers around long-lived objects, which frees programmers to use convenient high-level abstractions without performance costs. " [9]

---

"On some machines indirection is slower with displace- ment, so the most-used member of a structure or a record should be first. " -- Mike Morton Boston, Massachusetts

---

C compiler with support for structs written in Assembly (nongnu.org) https://bootstrapping.miraheze.org/wiki/Stage0 http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage2/cc_x86.s http://git.savannah.nongnu.org/cgit/stage0.git/tree/ISA_HEX_Map.org https://news.ycombinator.com/item?id=17851311

mmastrac 1 day ago [-]

I started working on a similar VM-based bootstrap for getting from bare metal to C compiler. If anyone is interested in collaborating let me know.

The idea is that you implement a simple, ASCII-based VM for your platform and that's enough to bootstrap a number of different assemblers, to a basic C compiler, to a full-fledged C compiler and (very minimal) POSIX environment.

The goal is twofold: a base for trusted compilation like this one, and a way to guarantee long-term executability of various archiving programs (ie: guarantee that you can unrar a file in 20 years with minimal work).

EDIT: very rough repo is here - https://github.com/mmastrac/bootstrap

reply

giomasce 1 day ago [-]

I am also doing something like that here:

https://gitlab.com/giomasce/asmc

My philosophy is of having a very simple and custom language (which I called G) for which it is easier to write a compiler in (x86) Assembly, then write a C compiler in G that is good enough to compile tcc. Then tcc is known to be able to compile gcc 4.7.4 (the last version which does not require a C++ compiler).

My C compiler is starting to have a shape, and in theory (if I find the time to work on it) it should not be far from supporting all that tcc needs.

The README in the linked page contains more information.

I'll look at your code too!

reply

---

so the reference implementation will be self-hosted.

but perhaps we can also have a more performant implementation written in Rust?

---

recall that the current idea with Boot is:

make Boot the sorta-intersection of LLVM, WASM, RISCV -- but make it really easy to implement (eg no infinite registers, unlike LLVM). That way it should be dead simple to port Boot to any platform.

---

why not to target Rust, and target LLVM instead:

https://news.ycombinator.com/item?id=12148124

---

the CCL (a lisp) implementation can compile itself in about 30 seconds [10] . So i guess that's a good goal

---

why Rust abandoned segmented stacks: https://mail.mozilla.org/pipermail/rust-dev/2013-November/006314.html

this guy approves: http://dtrace.org/blogs/bmc/2018/09/18/falling-in-love-with-rust/

golang also abandoned segmented stacks (but still has resizable stacks):

" Golang first started with a segmented-stacks model where the stack would actually expand into a separate area of memory, using some clever bookkeeping to keep track. A later implementation improved performance in specific cases by instead using a contiguous stack where instead of splitting the stack, much like resizing a hashtable, a new large stack is allocated and through some very tricky pointer manipulation, all the contents are carefully copied into the new, larger, stack. [return] " [11]

---

Rust sort of abandoned M:N greenthreads:

https://github.com/rust-lang/rfcs/blob/0806be4f282144cfcd55b1d20284b43f87cbe1c6/text/0230-remove-runtime.md

this guy approves:

http://dtrace.org/blogs/bmc/2018/09/18/falling-in-love-with-rust/

note however that this limits concurrency, so in Oot, we probably do want greenthreads: https://rcoh.me/posts/why-you-can-have-a-million-go-routines-but-only-1000-java-threads/

---

some things that many languages (eg oot) probably should have eventually:

https://hacks.mozilla.org/2018/10/webassemblys-post-mvp-future/

---

" Unknown said...

    "offset + x < length" is not a safe form for a bounds check, if x can be a value decoded from the network data. If x can have a value close to the maximum positive value for its type, then "offset + x" may overflow and compare less than length, even though x is not within bounds. Instead, compare x to length - offset, knowing that offset <= length is an invariant.

"

---

Ericson2314 1 day ago [-]

IMO most discussion of the compilation speed is incredibly narrow-minded. ((Value?)) incremental compilation more than anything else. Programs barely change each rebuild, so that should solve all the performance problems without having to maim ourselves.

reply

---

https://www.reddit.com/r/rust/comments/awx9cy/github_kyrenluster_an_experimental_lua_vm/ https://github.com/kyren/luster

" Inside luster are two libraries called "gc-arena" and "gc-sequence", and they represent a new (I believe novel?) system for safe garbage collection in Rust. There have been several attempts here before such as rust-gc and shifgrethor, and this represents another attempt with... different? limitations more appropriate for implementing language runtimes like Lua. "

---

" Major changes

    The lockfile (and configuration) format will become a strict subset of YAML. In the process, the lockfile information will be slightly changed to account for some long-awaited changes (such as #5892).
    We'll add support for plugins, which will be able to alter various things - from adding new commands to hooking into the resolution / fetching / linking steps to add support for new package sources or install targets.
    Related to the plugin system, Yarn will become an API as much as a CLI. You can expect to be able to require it and start using its components in your script - no need to parse your package.json anymore, no need to run the resolution .. Yarn will abstract all those tedious tasks away.
    Support for both Node 4 and Node 6 will be dropped. We don't expect Berry to have stable releases until after Node 6 gets EOL (expected April 2019), by which point it won't matter anymore.
    The log system will be overhauled - one thing in particular we'll take from Typescript are diagnostic error codes. Each error, warning, and sometimes notice will be given a unique code that will be documented - with explanations to help you understand how to unblock yourself.
    Some features currently in the core (such as autoclean) will be moved into contrib plugins. They'll still be supported, but might have a different release cycle than the standard bundle.
    The codebase will be ported from Flow to TypeScript. To understand the rational please continue reading, but a quick summary is that we hope this will help our community ramp up on Yarn, and will help you build awesome new features on top of it.
    The cache file format will switch from Tar to Zip, which offer better characteristics in terms of random access."

scrollaway 40 days ago [-]

Very happy to see yarn.lock will finally be a proper format that won't need its own parser. YAML subset is a pretty good choice, though I think canonicalized, indented JSON would be a better choice for this use case. Incidentally that's what npm uses as lockfile, I wonder if there's room to have the two package managers share the format (or even share the file itself).

Very excited to see shell compatibility guarantee in scripts as well. Using environment variables in scripts is a pain right now.

Finally one of the biggest news is the switch from Flow to Typescript. I think it's now clear that Facebook is admitting defeat with Flow; it brought a lot of good in the scene but Typescript is a lot more popular and gets overall much better support. Uniting the JS ecosystem around Typescript will be such a big deal.

.

wopian 40 days ago [-]

npm's lockfile is a pain to diff in PRs because of the JSON format where what was maybe 20 changed lines in yarn is upwards of 80 from the brackets.

With YAML and whatever format yarn.lock was in, the only changed lines are changes to the version resolutions, hash and dependencies.

donatj 40 days ago [-]

I'd say safely merging YAML diffs however could be trouble.

I don't know how restricted their YAML subset is, but in my experience it's so loose a format the only way to be sure YAML says what you think it says is to run it through a parser.

---

https://v8.dev/blog/spectre

---

when doing register allocation/assignment (allocation means to decide which variables get to be in registers, assignment means actually decide which particular register each variable goes in), you hear alot about 'linear scan' as the simple/low-latency way, but this is stil a 'global' method; the top answer at [12] described a 'local' method (local to each basic block).

another slide presentation i found once said something about two obvious local methods, 'bottom up' and 'top down'. I think 'top down' just meant to identify all of the variables in a basic block, then assign priorities to each of them.

for Oot's reference implementation (involving OVM->BootX?->Boot) we clearly just want a simple local method, because if we cared about performance, we'd go from OVM directly into the native platform, skipping BootX? and Boot. In fact for BootX?->Boot we should just use 'top down' and say that registers thru r7 map into registers and the rest are in memory somewhere -- this is easy to program, fast to compile, and has predictable performance for BootX? programmers.

should copy this stuff into plbook after i learn a little more about it.

---

level 1 gasche 10 points · 6 years ago

How do performance compare to OcamlJit?2?

The usual problems with C or LLVM backends for OCaml are:

    having to use a potentially costly calling convention (whereas OCaml's is optimized for nice tail call elimination and currying)
    trying to match OCaml extremely fast compilation of exceptions raising and catching
    interacting with the garbage collection, which may pass GC roots in registers rather than on the stack, something that are not often planned for in runtimes parametrized over an optional, generic GC

Even when targeting languages that are optimized to death (LLVM, C), these three issues alone are often enough to make programs slower than when using the apparently rather naive (not globally-optimizing) native OCaml compilation. Good choices of data structure representation, GC, calling and exception compilation can go a very long way, that more and more complex optimization techniques actually have a hard time competing with. Spending a man-year of work on a bright new register allocator might not actually do much better on real-world code if function calls are slower for fundamental architectural reasons. level 2 julesjacobs 4 points · 6 years ago · edited 6 years ago

You don't have to use the C calling convention with LLVM; it has a calling convention named "fastcc" that also supports tail calls. I'm not sure how fast it is compared to OCaml though.

The last two problems can be greatly reduced at the cost of slightly more expensive function calls. You transform each function so that it can exit in multiple different ways: return normally, throw exception, capture stack for GC. This can be done by returning a return code for each type of return. On an "throw exception" type return, if there is an exception handler in the current stack frame, invoke the handler. Otherwise return to the next stack frame (again with a "throw exception" return code). On an "capture stack for GC" type return, record the live local variables as GC roots, and continue further capture of the stack by also returning a "capture stack for GT" return code from this stack frame. This approach has the advantage that it supports reasonably fast exceptions, GC roots are in registers, and the exception control flow is exposed as normal control flow to LLVM optimization passes. This disadvantages are that throwing an exception across many stack frames is slow, and after each function return you have to check a return code. You also need some ingenuity to rebuild the stack after unrolling it for GC.

You can use the same stack capturing strategy to implement delimited continuations without CPS or unsafe stack manipulation. level 2 Raphael_Amiard 2 points · 6 years ago

Your comment is spot on, and you listed succesfully every problem this project has.

Additionally, this project is running OCaml bytecode, that is untyped, so it looses a bit more potential for optimization.

    About 1, you can use fastcc that supports tail calls, and that's what we are using.
    About 2, spot on, we're using setjmp and longjmp. If you're doing a backend for ocamlopt rather than a bytecode compiler, you might be able to use LLVM zero cost exceptions, but i don't know what the speed would be compared to ocamlopt.
    About 3, spot on, LLVM support for garbage collection is bad in this regard, you can't have roots in registers.

Some more information about the trade-offs i made here

---

 steveklabnik 18 days ago [-]

> It does make me thinking about how this could be useful in the context of Futures and async/await.

At its core, when you pass a Future (really, a chain of futures, but in the end that's still a Future) to an executor, the executor creates a stack for it. An executor with multiple futures will then switch between them, sorta similar to the code seen here. The details depend on the exact executor, of course, but the fundamental idea is the same.

One of the nice things about the futures design is that you can know exactly how big the stack needs to be, so the resizing stuff isn't an issue.

in a discussion on a tutorial on implementing greenthreads: https://news.ycombinator.com/item?id=20164062

---

cosarara 20 hours ago [-]

How does this approach play with error handling? If there is an error in my code and it's only caught at step 10, won't the error message be completely inscrutable? Will the line numbers make sense?

reply

thedufer 18 hours ago [-]

In the only compiler I'm somewhat familiar with (OCaml), a decent amount of the work is done in AST->AST transformations. In the AST, every node is annotated with a location, which is a tuple of filename, line number, char start, char end. As long as these are propagated in the right way, you get good location information in errors.

reply

sterlind 18 hours ago [-]

A simple approach could be to mark up spans of characters in layer N with what spans were used to generate them in layer N-1. Then when parser N hits an error, put red squiggly lines under the source characters responsible. For instance, in C#:

using (<span decl>var x = 3</span>) { <span inner>Console.WriteLine?(x);</span>}

Parser 0 transforms the using:

<span stmt1 parent=decl>var x = 3;</span> try { <span stmt2 parent=inner> Console.WriteLine?(x);</span> } finally { <span stmt3=generated>x.Dispose();</span> }

Parser 1 (or a later parser, inductively) notices x doesn't have Dispose() defined, and throws MissingMethod?(Dispose, span=generated).

Parser 0 catches this error, knows that this is because x isn't disosable, and emits its own error: NotDisposable?(span=decl).

If a parser doesn't know to catch the error, it still passes it up the chain, transforming spans where appropriate. When it hits the IDE, the error is printed and the red squiggly lines go where they should.

(Disclaimer: I've never written a compiler)

reply

mojuba 19 hours ago [-]

Line number information is usually passed down the chain because it's required for generating debug info anyway.

However, as you descend into deeper optimization layers of the compiler, errors are less and less likely to occur. Optimizations are either possible or not, though sometimes these layers can warn you about possible inefficiencies at the source code level (unused local, etc.)

reply

---

fluffything 1 day ago [-]

The rust compiler now has features that few or none C and C++ compilers have: incremental compilation within a single translation unit, pipelined compilation, multi-threaded and lazy query-based compilation, ...

Implementing each of these features have required whole program refactorings in a large-scale codebase performed by few individuals while hundreds of other developers where simultaneously evolving the software and implementing new features.

Having been a C++ programmer for over 10 years, none of these refactorings would have payed off in C++ because of how time consuming it would have been to track all bugs introduced by them.

Yet they do pay off in Rust because of my favourite Rust feature: the ability to refactor large-scale software without breaking anything: if a non-functional change compiles, "it works" for some definition of "works" that's much better than what most other languages give you (no memory unsafety, no data-races, no segfaults...).

Very few programming languages have this feature, and no low-level programming language except for Rust has it.

fluffything 1 day ago [-]

I just use `cargo watch -x test -d 5` to run all my tests after I pause modifying code for 5 seconds.

My editor (emacs) uses `cargo watch -x check -d 0.5` to run `cargo check` (which is blazing fast for incremental edits) to type check all the code and show "red squiggles" with the error messages inline.

So my interactive workflow with Rust is only edit-"type check"-edit-"type check" where "type check" takes often less than a second.

Asynchronously in the background, the whole test suite (or the part that makes sense for what I'm doing) is always being run. So if a test actually fails to run, I discover that a little bit later.

I don't know any language for which running all tests happens instantaneously. With Rust, if anything, I write less tests because there are many things I don't need to check (e.g. what happens on out-of-bounds accesses).

This is the best workflow I've ever had. In C++ there was no way to only run type checking, so one always had to run the full compilation and linking stages, where linking took a long time, and you could get template instantiation errors quite late, and well, linking errors. I don't think I've ever seen a Rust linking error. They probably do happen, but in C++ they happen relatively often (at least once per week).

reply

---

howenterprisey 1 day ago [-]

Does LLVM still take up much of the overall time spent by the Rust compiler? I was thinking of getting involved over there as the most effective way to make speed-ups happen.

reply

nindalf 1 day ago [-]

I remember reading that it does, but because the LLVM IR generated by rustc is verbose. If less IR was generated, LLVM would have less work to do.

reply

masklinn 1 day ago [-]

I think that's one of the eventual purposes of MIR (by both optimising Rust upstream from the LLVM IR generation and having a much simpler language to generate IR for), but I don't know if there's been work towards that yet.

reply

epage 1 day ago [-]

I thought it was two factors (1) unoptimized IR and (2) large translation units (crate rather than file).

reply

---