Bayle Shanks's website: proj-oot-ootImplementationNotes3

(Simon Marlow on GHC:)

simonmar on Sept 4, 2010 [-]

Oh yes, immutability is crucial. Generational GC already makes you pay for mutation with a write barrier, and in our parallel GC we omit the locking when copying immutable objects, accepting that a few might get copied twice. In the new GC mutable objects become even more expensive. I don't think local GC is viable at all in a language with ubiquitous mutation.

---

Roboprog on Sept 3, 2010 [-]

An interesting approach: giving each thread its own "young generation" sub-heap, so transient objects can be disposed of without coordination from other threads / CPUs and their cache pages.

scott_s on Sept 3, 2010 [-]

I worked on a memory allocator (as in malloc, not garbage collection) that takes a similar approach: http://people.cs.vt.edu/~scschnei/streamflow/

A group at Intel independently came up with a similar approach as well: http://portal.acm.org/citation.cfm?id=1133967

Roboprog on Sept 6, 2010 [-]

Cool, the steamflow thing sounds interesting. Is there a top level example or test-driver somewhere in the github project showing what typical use-cases are?

E.g. - I have a test-driver here: http://github.com/roboprog/buzzard/blob/master/test/src/main... (although I have barely started the library I was tinkering on)

Any example client program for your allocator? I'd like to see what use cases you are handling.

scott_s on Sept 6, 2010 [-]

The Larson and Recycle benchmarks are on github. You can read about them in the paper. Email me if you'd like to see an unpublished paper which has some more detail on the allocator's design.

larsberg on Sept 3, 2010 [-]

We've been doing this in Manticore since 2008 or so. We couldn't really get speedups past 12 cores without it (we have a "typical" parallel GC implemented as well to test against). Hopefully we'll get the paper on the GC and associated language trickery - we don't allow pointers between young generations - in somewhere soon :)

simonmar on Sept 3, 2010 [-]

Yes, the GHC design has certainly been influenced by Manticore (that was one of the "other designs" I referred to). Though in GHC we do have some different problems to solve, the worst of which is that we have to support a bunch of programming abstractions that use mutation.

---

interesting (unimplemented?) PEP on Python bytecode verification: http://legacy.python.org/dev/peps/pep-0330/

---

some things we might want to use machine registers for when building an OVM interpreter in assembly:

cache the first few SMALLSTACK entries (transparently spill/unspill the rest to memory)
cache one local variable, if OVM has those
PC
the first few OVM registers
MEMSTACK stack pointer
pointer to constant table
pointer to locals and globals tables
the C stack pointer

---

http://www.dyncall.org/

dyncall library The dyncall library encapsulates architecture-, OS- and compiler-specific function call semantics in a virtual bind argument parameters from left to right and then call interface allowing programmers to call C functions in a completely dynamic manner. In other words, instead of calling a function directly, the dyncall library provides a mechanism to push the function parameters manually and to issue the call afterwards. This means, that a program can determine at runtime what function to call, and what parameters to pass to it. The library is written in C and assembly and provides a very simple C interface to program against. "

---

i think i already noted this somewhere but some of these comments are a good read:

https://news.ycombinator.com/item?id=10032295

---

" up vote 13 down vote accepted

There are a number of papers on different kinds of dispatch:

M. Anton Ertl and David Gregg, Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters, in Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI 03), pp. 278-288, San Diego, California, June 2003.

M. Anton Ertl and David Gregg, The behaviour of efficient virtual machine interpreters on modern architectures, in Proceedings of the 7th European Conference on Parallel Computing (Europar 2001), pp. 403-412, LNCS 2150, Manchester, August 2001.

An excellent summary is provided by Yunhe Shi in his PhD? thesis.

Also, someone discovered a new technique a few years ago which is valid ANSI C.

---

Doug's C# parser framework:

https://github.com/drubino/Linq.Parsers

---

"...can use FASM, which is self-hosting, if I can't compile NASM or YASM. " [1]

---

[2] [3]

" Context-free grammars. What this really means is the code should be parsable without having to look things up in a symbol table. C++ is famously not a context-free grammar. A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating most of a compiler front end. As a result, third-party tools become much more likely to exist. Redundancy. Yes, the grammar should be redundant. You've all heard people say that statement terminating ; are not necessary because the compiler can figure it out. That's true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages. "

"The first tool that beginning compiler writers often reach for is regex. Regex is just the wrong tool for lexing and parsing. Rob Pike explains why reasonably well."

" The philosophies of error message handling are:

    Print the first message and quit. This is, of course, the simplest approach, and it works surprisingly well. Most compilers' follow-on messages are so bad that the practical programmer ignores all but the first one anyway. The holy grail is to find all the actual errors in one compile pass, leading to:
    Guess what the programmer intended, repair the syntax trees, and continue. This is an ever-popular approach. I've tried it indefatigably for decades, and it's just been a miserable failure. The compiler seems to always guess wrong, and subsequent messages with the "fixed" syntax trees are just ludicrously wrong.
    The poisoning approach. This is much like how floating-point NaNs are handled. Any operation with a NaN operand silently results in a NaN. Applying this to error recovery, and any constructs that have a leaf for which an error occurred, is itself considered erroneous (but no additional error messages are emitted for it). Hence, the compiler is able to detect multiple errors as long as the errors are in sections of code with no dependency between them. This is the approach we've been using in the D compiler, and are very pleased with the results."

" Runtime Library

Rarely mentioned, but critical, is the need to write a runtime library. This is a major project. It will serve as a demonstration of how the language features work, so it had better be good. Some critical things to get right include:

    I/O performance. Most programs spend a lot of time in I/O. Slow I/O will make the whole language look bad. The benchmark is C stdio. If the language has elegant, lovely I/O APIs, but runs at only half the speed of C I/O, then it just isn't going to be attractive.
    Memory allocation. A high percentage of time in most programs is spent doing mundane memory allocation. Get this wrong at your peril.
    Transcendental functions. OK, I lied. Nobody cares about the accuracy of transcendental functions, they only care about their speed. My proof comes from trying to port the D runtime library to different platforms, and discovering that the underlying C transcendental functions often fail the accuracy tests in the D library test suite. C library functions also often do a poor job handling the arcana of the IEEE floating-point bestiary — NaNs, infinities, subnormals, negative 0, etc. In D, we compensated by implementing the transcendental functions ourselves. Transcendental floating-point code is pretty tricky and arcane to write, so I'd recommend finding an existing library you can license and adapting that.
    A common trap people fall into with standard libraries is filling them up with trivia. Trivia is sand clogging the gears and just dead weight that has to be carried around forever. My general rule is if the explanation for what the function does is more lines than the implementation code, then the function is likely trivia and should be booted out."

---

[4]

" String I/O should be unicode-aware & support utf-8. Binary I/O should exist. Console I/O is nice, and you should support it if only for the sake of having a REPL with readline-like features. Basically all of this can be done by making your built-in functions wrappers around the appropriate safe I/O functions from whatever language you’re building on top of (even C, although I wouldn’t recommend it). It’s no longer acceptable to expect strings to be zero-terminated rather than length-prefixed. It’s no longer acceptable to have strings default to ascii encoding instead of unicode. In addition to supporting unicode strings, you should also probably support byte strings, something like a list or array (preferably with nesting), and dictionaries/associative arrays. It’s okay to make your list type do double-duty as your stack and queue types and to make dictionaries act as classes and objects. Good support for ranges/spans on lists and strings is very useful. If you expect your language to do string processing, built-in regex is important. If you provide support for parallelism that’s easier to manage than mutexes, your developers will thank you. While implicit parallelism can be hard to implement in imperative languages (much easier in functional or pure-OO languages), even providing support for thread pools, a parallel map/apply function, or piping data between independent threads (like in goroutines or the unix shell) would help lower the bar for parallelism support. Make sure you have good support for importing third party packages/modules, both in your language and in some other language. Compiled languages should make it easy to write extensions in C (and you’ll probably be writing most of your built-ins this way anyway). If you’re writing your interpreted language in another interpreted language (as I did with Mycroft) then make sure you expose some facility to add built-in functions in that language. For any interpreted language, a REPL with a good built-in online help system is a must. Users who can’t even try out your language without a lot of effort will be resistant to using it at all, whereas a simple built-in help system can turn exploration of a new language into an adventure. Any documentation you have written for core or built-in features (including documentation on internal behavior) should be assessible from the REPL. This is easy to implement (see Mycroft’s implementation of online help) and is at least as useful for the harried language developer as for the new user. "

---

ufo 7 days ago [-]

Deoptimization is actually really hard to implement if you have an ahead of time compiler like Cannoli. You need to get all the stuff that is living in machine registers or in the C stack and then convert them back to whatever representation your generic interpreter uses.

I think this is actually one of the things that most get in the way if you want to use traditional AOT compiler technology (like gcc or LLVM) to implement a JIT. In state of the art JIT compilers this part is always a nest of highly complex and nonportable assembly language.

---

tathougies 7 days ago [-]

Compiling to machine code is not a panacea for optimization. A optimized JIT compiler is going to blow an AOT compiler out of the water. Being smart about the machine code generated is significantly more important than generating machine code. In particular, PyPy? makes several optimizations over python code that a more direct implementation of CPython at the machine level probably wouldn't. For example, PyPy? erases dictionary lookups for object member access if the object shape is statically known. Given how prevalent this kind of lookup is in Python code, it's possible that even an interpreter that made this optimization would be faster than a machine code version that used an actual hash table.

I think this compiler also makes this particular optimization, but this is just one of many many optimizations PyPy? does. I imagine that with sufficient work, this compiler could be brought up to speed with PyPy?, but as it stands right now, PyPy? simply benefits from having years of optimization work that a new project doesn't.

ori_b 7 days ago [-]

For most dynamic languages, the available speedups aren't in simple compilation, but in removing the runtime type checks, method lookups, and other slow operations. This needs the ability to guess at what the code is going to do based on past behavior, and generate specialized versions that get thrown away if the guesses are invalidated.

So, for example, you might see that the last 100 calls to a function were done with integers, so you can generate a variant of the function that only works for integers, and check if it's applicable when you enter the function. If that function stops getting used, you can throw it away.

Doing that well ahead of time requires an extremely good idea of how the program will behave at run time, and even with good information, is still very likely to bloat up your binary hugely. (Facebook used to compile their PHP codebase to a multi-gigabyte binary before moving to HHVM, for example).

collyw 7 days ago [-]

Actually I think you answered a question that I already asked about Rust being faster than C. If you don't need to carry out as many checks then I see how that will speed things up.

xapata 7 days ago [-]

JITs get to analyze both code and data and optimize for each machine deployed to. A static compiler can only analyze code and the machine used for compilation. If dependencies were pre-compiled, the static compiler won't be able to optimize their relationship with the project. If the machine is changed for deployment.

More information means better optimizations. JITs FTW.

jcranmer 7 days ago [-]

JITs also tend to optimize for compilation time over final performance. Doing ahead-of-time superoptimization or polyhedral loop transformation isn't going to happen in a JIT.

xapata 7 days ago [-]

There's no restriction on the kind of optimization a JIT can do. Perhaps there's a current implementation tendency. In contrast, ahead-of-time (data- and platform-ignorant) compilers are restricted.

---

example of compilation to Rust:

https://github.com/joncatanio/cannoli https://github.com/joncatanio/cannoli/blob/master/resources/papers/cannoli-thesis-paper.pdf

---

" 4.1 Why Compile Python to Rust? We considered three different intermediate representations to compile Python code into. The first was LLVM [17], to leverage the many optimizations implemented by the LLVM compiler. Targeting LLVM, however, would require implementing a garbage collector (or simply ignoring memory management for this prototype). The implementation would be simplified by implementing a standard library to handle various elements of the language, but writing this library in LLVM was considered to not be ideal. Writing the library in C, compiled to LLVM code, would eliminate some of the complexity of writing the library directly in LLVM. Following this idea further we considered targeting C and compiling against a library written in C. Unfortunately, this does not address the issue of memory management. Targeting Rust and compiling the output Rust code against a library written in Rust was considered the best of both worlds. We get a large number of optimizations from the Rust compiler (and the LLVM compiler) as well as memory management provided by Rust’s ownership rules. Therefore, Cannoli is a Python compiler written in Rust that compiles Python code into Rust code. " -- [5]

---

chubot 23 days ago

parent [-]

on: Python startup time: milliseconds matter

This is disappointing to me too, but I think there are some problems baked in to the language that make it hard.

Imports can't be parsed statically.
Startup time has two major components: crawling the file system for imports, and running all the init() functions of every module, which happens before you get to main(). The first is only fixable through breaking changes, and the second is hard to fix without drastically changing the language.

The import code in CPython was a mess, which was apparently cleaned up by importlib in Python 3, through tremendous effort. But unfortunately I think importlib made things slower?

I recall a PyCon? talk where as of 3.6, essentially everything about Python 3 is now faster than Python 2, EXCEPT startup time!

This is a shame, because I would have switched to Python 3 for startup time ALONE. (As of now, most of my code and that of my former employer is Python 2.) That would have been the perfect time to address startup time, because getting a 2x-10x improvement (which is what's needed) requires breaking changes.

I don't think there's a lack of interest in the broader Python community, but there might be a lack of interest/manpower in the core team, which leads to the situation wonderfully summarized in the recent xkcd:

https://xkcd.com/1987/

FWIW I was the one who sent a patch to let Python run a .zip file back in 2007 or so, for Python 2.6 I think. This was roughly based on what we did at Google for self-contained applications. A core team member did a cleaner version of my patch, although this meant it was undocumented until Python 3.5 or so:

https://docs.python.org/3/library/zipapp.html

The .zip support at runtime was a start, but it's really the tooling that's a problem. And it's really the language that inhibits tooling.

Also, even if you distributed self-contained applications, the startup time is not great. It's improved a bit because you're "statting" a zip file rather than making syscalls, but it's still not great.

In other words, I have wondered about this "failure" for over a decade myself, and even tried to do something about it. I think the problem is that there are multiple parts to the solution, the responsibility for these parts is distributed. I hate to throw everything on the core team, but module systems and packaging are definitely a case where "distributed innovation" doesn't work. There has to be a central team setting standards that everyone else follows.

Also, it's not a trivial problem. Go is a static language and is doing better in this regard, but still people complain about packaging. (vgo is coming out after nearly a decade, etc.)

I should also add that while I think Python packaging is in the category of "barely works", I would say the same is true of Debian. And Debian is arguably the most popular Linux package manager. They're cases of "failure by success".

...

FWIW I think importing is heavily bottlenecked by I/O, in particular stat() of tons of "useless" files. In theory the C to Python change shouldn't have affected it much. But I haven't looked into it more deeply than that.

chubot 23 days ago [-]

EDIT: I should also add that the length of PYTHONPATH as constructed by many package managers is a huge problem. You're doing O(m*n) stat()s -- random disk access -- which is the slowest thing your computer can do.

m is the number of libraries you're importing, and n is the length of the PYTHONPATH.

So it gets really bad, and it's not just one person's "fault". It's a collusion between the Python interpreter's import logic and how package managers use it.

---

probably recursive descent with shunting yard subroutine for operator expression parsers (to handle precedence; pure recursive descent can't handle that [6], section 'The Operator Issue as of 1961')

instead of shunting yard, mb Pratt parsing (also called'precedence climbing') [7]

alternately, parse operator expressions as lists, then add associativity in postprocessing.

" In addition, while pure recursive descent cannot parse operator expressions, it can recognize them. This means pure recursive descent may not be able to create the parse subtree for an operator expression itself, but it can recognize the expression and hand control over to a specialized operator expression parser. This seems to be what Lucas' 1961 algorithm did, and it is certainly what many other implementations did afterwards. Adding the operator expression subparser makes the implementation only quasi-Chomskyan, but this was a price the profession has been willing to pay.

Alternatively, a recursive descent implementation can parse operator expressions as lists, and add associativity in post-processing. This pushes some of the more important parsing out of the syntactic phase into the semantics but, once again, it seemed that Chomskyan purity had to be thrown overboard if the ship was to stay afloat. " [8]

of course, the author of that suggests his parser, MARPA.

---

" Julia’s optimizer has gotten smarter in more ways than we can list here, but a few highlights are worth mentioning. The optimizer can now propagate constants through function calls, allowing much better dead-code elimination and static evaluation than before. The compiler is also much better at avoiding allocation of short-lived wrappers around long-lived objects, which frees programmers to use convenient high-level abstractions without performance costs. " [9]

---

"On some machines indirection is slower with displace- ment, so the most-used member of a structure or a record should be first. " -- Mike Morton Boston, Massachusetts

---

C compiler with support for structs written in Assembly (nongnu.org) https://bootstrapping.miraheze.org/wiki/Stage0 http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage2/cc_x86.s http://git.savannah.nongnu.org/cgit/stage0.git/tree/ISA_HEX_Map.org https://news.ycombinator.com/item?id=17851311

mmastrac 1 day ago [-]

I started working on a similar VM-based bootstrap for getting from bare metal to C compiler. If anyone is interested in collaborating let me know.

The idea is that you implement a simple, ASCII-based VM for your platform and that's enough to bootstrap a number of different assemblers, to a basic C compiler, to a full-fledged C compiler and (very minimal) POSIX environment.

The goal is twofold: a base for trusted compilation like this one, and a way to guarantee long-term executability of various archiving programs (ie: guarantee that you can unrar a file in 20 years with minimal work).

EDIT: very rough repo is here - https://github.com/mmastrac/bootstrap

giomasce 1 day ago [-]

I am also doing something like that here:

https://gitlab.com/giomasce/asmc

My philosophy is of having a very simple and custom language (which I called G) for which it is easier to write a compiler in (x86) Assembly, then write a C compiler in G that is good enough to compile tcc. Then tcc is known to be able to compile gcc 4.7.4 (the last version which does not require a C++ compiler).

My C compiler is starting to have a shape, and in theory (if I find the time to work on it) it should not be far from supporting all that tcc needs.

The README in the linked page contains more information.

I'll look at your code too!

---

so the reference implementation will be self-hosted.

but perhaps we can also have a more performant implementation written in Rust?

---

recall that the current idea with Boot is:

make Boot the sorta-intersection of LLVM, WASM, RISCV -- but make it really easy to implement (eg no infinite registers, unlike LLVM). That way it should be dead simple to port Boot to any platform.

---

why not to target Rust, and target LLVM instead:

https://news.ycombinator.com/item?id=12148124

---

the CCL (a lisp) implementation can compile itself in about 30 seconds [10] . So i guess that's a good goal

---

why Rust abandoned segmented stacks: https://mail.mozilla.org/pipermail/rust-dev/2013-November/006314.html

this guy approves: http://dtrace.org/blogs/bmc/2018/09/18/falling-in-love-with-rust/

golang also abandoned segmented stacks (but still has resizable stacks):

" Golang first started with a segmented-stacks model where the stack would actually expand into a separate area of memory, using some clever bookkeeping to keep track. A later implementation improved performance in specific cases by instead using a contiguous stack where instead of splitting the stack, much like resizing a hashtable, a new large stack is allocated and through some very tricky pointer manipulation, all the contents are carefully copied into the new, larger, stack. [return] " [11]

---

Rust sort of abandoned M:N greenthreads:

https://github.com/rust-lang/rfcs/blob/0806be4f282144cfcd55b1d20284b43f87cbe1c6/text/0230-remove-runtime.md

this guy approves:

http://dtrace.org/blogs/bmc/2018/09/18/falling-in-love-with-rust/

note however that this limits concurrency, so in Oot, we probably do want greenthreads: https://rcoh.me/posts/why-you-can-have-a-million-go-routines-but-only-1000-java-threads/

---

some things that many languages (eg oot) probably should have eventually:

https://hacks.mozilla.org/2018/10/webassemblys-post-mvp-future/

---

" Unknown said...

    "offset + x < length" is not a safe form for a bounds check, if x can be a value decoded from the network data. If x can have a value close to the maximum positive value for its type, then "offset + x" may overflow and compare less than length, even though x is not within bounds. Instead, compare x to length - offset, knowing that offset <= length is an invariant.

---

Ericson2314 1 day ago [-]

IMO most discussion of the compilation speed is incredibly narrow-minded. ((Value?)) incremental compilation more than anything else. Programs barely change each rebuild, so that should solve all the performance problems without having to maim ourselves.

---

https://www.reddit.com/r/rust/comments/awx9cy/github_kyrenluster_an_experimental_lua_vm/ https://github.com/kyren/luster

" Inside luster are two libraries called "gc-arena" and "gc-sequence", and they represent a new (I believe novel?) system for safe garbage collection in Rust. There have been several attempts here before such as rust-gc and shifgrethor, and this represents another attempt with... different? limitations more appropriate for implementing language runtimes like Lua. "

---

" Major changes

    The lockfile (and configuration) format will become a strict subset of YAML. In the process, the lockfile information will be slightly changed to account for some long-awaited changes (such as #5892).

    We'll add support for plugins, which will be able to alter various things - from adding new commands to hooking into the resolution / fetching / linking steps to add support for new package sources or install targets.

    Related to the plugin system, Yarn will become an API as much as a CLI. You can expect to be able to require it and start using its components in your script - no need to parse your package.json anymore, no need to run the resolution .. Yarn will abstract all those tedious tasks away.

    Support for both Node 4 and Node 6 will be dropped. We don't expect Berry to have stable releases until after Node 6 gets EOL (expected April 2019), by which point it won't matter anymore.

    The log system will be overhauled - one thing in particular we'll take from Typescript are diagnostic error codes. Each error, warning, and sometimes notice will be given a unique code that will be documented - with explanations to help you understand how to unblock yourself.

    Some features currently in the core (such as autoclean) will be moved into contrib plugins. They'll still be supported, but might have a different release cycle than the standard bundle.

    The codebase will be ported from Flow to TypeScript. To understand the rational please continue reading, but a quick summary is that we hope this will help our community ramp up on Yarn, and will help you build awesome new features on top of it.

    The cache file format will switch from Tar to Zip, which offer better characteristics in terms of random access."

scrollaway 40 days ago [-]

Very happy to see yarn.lock will finally be a proper format that won't need its own parser. YAML subset is a pretty good choice, though I think canonicalized, indented JSON would be a better choice for this use case. Incidentally that's what npm uses as lockfile, I wonder if there's room to have the two package managers share the format (or even share the file itself).

Very excited to see shell compatibility guarantee in scripts as well. Using environment variables in scripts is a pain right now.

Finally one of the biggest news is the switch from Flow to Typescript. I think it's now clear that Facebook is admitting defeat with Flow; it brought a lot of good in the scene but Typescript is a lot more popular and gets overall much better support. Uniting the JS ecosystem around Typescript will be such a big deal.

wopian 40 days ago [-]

npm's lockfile is a pain to diff in PRs because of the JSON format where what was maybe 20 changed lines in yarn is upwards of 80 from the brackets.

With YAML and whatever format yarn.lock was in, the only changed lines are changes to the version resolutions, hash and dependencies.

donatj 40 days ago [-]

I'd say safely merging YAML diffs however could be trouble.

I don't know how restricted their YAML subset is, but in my experience it's so loose a format the only way to be sure YAML says what you think it says is to run it through a parser.

---

https://v8.dev/blog/spectre

---

when doing register allocation/assignment (allocation means to decide which variables get to be in registers, assignment means actually decide which particular register each variable goes in), you hear alot about 'linear scan' as the simple/low-latency way, but this is stil a 'global' method; the top answer at [12] described a 'local' method (local to each basic block).

another slide presentation i found once said something about two obvious local methods, 'bottom up' and 'top down'. I think 'top down' just meant to identify all of the variables in a basic block, then assign priorities to each of them.

for Oot's reference implementation (involving OVM->BootX?->Boot) we clearly just want a simple local method, because if we cared about performance, we'd go from OVM directly into the native platform, skipping BootX? and Boot. In fact for BootX?->Boot we should just use 'top down' and say that registers thru r7 map into registers and the rest are in memory somewhere -- this is easy to program, fast to compile, and has predictable performance for BootX? programmers.

should copy this stuff into plbook after i learn a little more about it.

---

level 1 gasche 10 points · 6 years ago

How do performance compare to OcamlJit?2?

The usual problems with C or LLVM backends for OCaml are:

    having to use a potentially costly calling convention (whereas OCaml's is optimized for nice tail call elimination and currying)

    trying to match OCaml extremely fast compilation of exceptions raising and catching

    interacting with the garbage collection, which may pass GC roots in registers rather than on the stack, something that are not often planned for in runtimes parametrized over an optional, generic GC

Even when targeting languages that are optimized to death (LLVM, C), these three issues alone are often enough to make programs slower than when using the apparently rather naive (not globally-optimizing) native OCaml compilation. Good choices of data structure representation, GC, calling and exception compilation can go a very long way, that more and more complex optimization techniques actually have a hard time competing with. Spending a man-year of work on a bright new register allocator might not actually do much better on real-world code if function calls are slower for fundamental architectural reasons. level 2 julesjacobs 4 points · 6 years ago · edited 6 years ago

You don't have to use the C calling convention with LLVM; it has a calling convention named "fastcc" that also supports tail calls. I'm not sure how fast it is compared to OCaml though.

The last two problems can be greatly reduced at the cost of slightly more expensive function calls. You transform each function so that it can exit in multiple different ways: return normally, throw exception, capture stack for GC. This can be done by returning a return code for each type of return. On an "throw exception" type return, if there is an exception handler in the current stack frame, invoke the handler. Otherwise return to the next stack frame (again with a "throw exception" return code). On an "capture stack for GC" type return, record the live local variables as GC roots, and continue further capture of the stack by also returning a "capture stack for GT" return code from this stack frame. This approach has the advantage that it supports reasonably fast exceptions, GC roots are in registers, and the exception control flow is exposed as normal control flow to LLVM optimization passes. This disadvantages are that throwing an exception across many stack frames is slow, and after each function return you have to check a return code. You also need some ingenuity to rebuild the stack after unrolling it for GC.

You can use the same stack capturing strategy to implement delimited continuations without CPS or unsafe stack manipulation. level 2 Raphael_Amiard 2 points · 6 years ago

Your comment is spot on, and you listed succesfully every problem this project has.

Additionally, this project is running OCaml bytecode, that is untyped, so it looses a bit more potential for optimization.

    About 1, you can use fastcc that supports tail calls, and that's what we are using.

    About 2, spot on, we're using setjmp and longjmp. If you're doing a backend for ocamlopt rather than a bytecode compiler, you might be able to use LLVM zero cost exceptions, but i don't know what the speed would be compared to ocamlopt.

    About 3, spot on, LLVM support for garbage collection is bad in this regard, you can't have roots in registers.

Some more information about the trade-offs i made here

---

 steveklabnik 18 days ago [-]

> It does make me thinking about how this could be useful in the context of Futures and async/await.

At its core, when you pass a Future (really, a chain of futures, but in the end that's still a Future) to an executor, the executor creates a stack for it. An executor with multiple futures will then switch between them, sorta similar to the code seen here. The details depend on the exact executor, of course, but the fundamental idea is the same.

One of the nice things about the futures design is that you can know exactly how big the stack needs to be, so the resizing stuff isn't an issue.

in a discussion on a tutorial on implementing greenthreads: https://news.ycombinator.com/item?id=20164062

---

cosarara 20 hours ago [-]

How does this approach play with error handling? If there is an error in my code and it's only caught at step 10, won't the error message be completely inscrutable? Will the line numbers make sense?

thedufer 18 hours ago [-]

In the only compiler I'm somewhat familiar with (OCaml), a decent amount of the work is done in AST->AST transformations. In the AST, every node is annotated with a location, which is a tuple of filename, line number, char start, char end. As long as these are propagated in the right way, you get good location information in errors.

sterlind 18 hours ago [-]

A simple approach could be to mark up spans of characters in layer N with what spans were used to generate them in layer N-1. Then when parser N hits an error, put red squiggly lines under the source characters responsible. For instance, in C#:

using (var x = 3) { Console.WriteLine?(x);}

Parser 0 transforms the using:

var x = 3; try { Console.WriteLine?(x); } finally { x.Dispose(); }

Parser 1 (or a later parser, inductively) notices x doesn't have Dispose() defined, and throws MissingMethod?(Dispose, span=generated).

Parser 0 catches this error, knows that this is because x isn't disosable, and emits its own error: NotDisposable?(span=decl).

If a parser doesn't know to catch the error, it still passes it up the chain, transforming spans where appropriate. When it hits the IDE, the error is printed and the red squiggly lines go where they should.

(Disclaimer: I've never written a compiler)

mojuba 19 hours ago [-]

Line number information is usually passed down the chain because it's required for generating debug info anyway.

However, as you descend into deeper optimization layers of the compiler, errors are less and less likely to occur. Optimizations are either possible or not, though sometimes these layers can warn you about possible inefficiencies at the source code level (unused local, etc.)

---

fluffything 1 day ago [-]

The rust compiler now has features that few or none C and C++ compilers have: incremental compilation within a single translation unit, pipelined compilation, multi-threaded and lazy query-based compilation, ...

Implementing each of these features have required whole program refactorings in a large-scale codebase performed by few individuals while hundreds of other developers where simultaneously evolving the software and implementing new features.

Having been a C++ programmer for over 10 years, none of these refactorings would have payed off in C++ because of how time consuming it would have been to track all bugs introduced by them.

Yet they do pay off in Rust because of my favourite Rust feature: the ability to refactor large-scale software without breaking anything: if a non-functional change compiles, "it works" for some definition of "works" that's much better than what most other languages give you (no memory unsafety, no data-races, no segfaults...).

Very few programming languages have this feature, and no low-level programming language except for Rust has it.

fluffything 1 day ago [-]

I just use `cargo watch -x test -d 5` to run all my tests after I pause modifying code for 5 seconds.

My editor (emacs) uses `cargo watch -x check -d 0.5` to run `cargo check` (which is blazing fast for incremental edits) to type check all the code and show "red squiggles" with the error messages inline.

So my interactive workflow with Rust is only edit-"type check"-edit-"type check" where "type check" takes often less than a second.

Asynchronously in the background, the whole test suite (or the part that makes sense for what I'm doing) is always being run. So if a test actually fails to run, I discover that a little bit later.

I don't know any language for which running all tests happens instantaneously. With Rust, if anything, I write less tests because there are many things I don't need to check (e.g. what happens on out-of-bounds accesses).

This is the best workflow I've ever had. In C++ there was no way to only run type checking, so one always had to run the full compilation and linking stages, where linking took a long time, and you could get template instantiation errors quite late, and well, linking errors. I don't think I've ever seen a Rust linking error. They probably do happen, but in C++ they happen relatively often (at least once per week).

---

howenterprisey 1 day ago [-]

Does LLVM still take up much of the overall time spent by the Rust compiler? I was thinking of getting involved over there as the most effective way to make speed-ups happen.

nindalf 1 day ago [-]

I remember reading that it does, but because the LLVM IR generated by rustc is verbose. If less IR was generated, LLVM would have less work to do.

masklinn 1 day ago [-]

I think that's one of the eventual purposes of MIR (by both optimising Rust upstream from the LLVM IR generation and having a much simpler language to generate IR for), but I don't know if there's been work towards that yet.

epage 1 day ago [-]

I thought it was two factors (1) unoptimized IR and (2) large translation units (crate rather than file).

---

i guess one difference between Boot and BootX? is that when compiling the compiler itself, it should compile to pure Boot (e.g. not use floats, etc) (or perhaps there could be build options that would use BootX? (e.g. paralellize) but they are optional and off by default)

---

y'know, you could have more than one IL at each step of compilation. These could represent alternative represenations types (e.g. SSA vs CPS), or they could represent compile vs interpret. So now the flowchart of compilation is a DAG, not a line. There are 4 important stages though:

Oot Oot Core OVM (OVM_high) Boot/BootX?

---

https://tokio.rs/blog/2019-10-scheduler/ https://news.ycombinator.com/item?id=21249708

---

my take: this sounds really great, but as e says, "It's actually fairly trivial to dynamically link a system where all the implementation details are hidden behind uniformity and dynamism...But Swift didn't do this. Swift tries its hardest to generate code comparable to what you would expect from Rust or C++, and how it accomplishes that is what makes its ABI so interesting.".

so we should just take the easy road here rather than the fast road. BUT we should adopt Swift's LIMITATIONS so that we can change our mind and do what Swift does later, if we can tolerate the additional implementation complexity.

https://gankra.github.io/blah/swift-abi/

" How Swift Achieved Dynamic Linking Where Rust Couldn't

For those who don't follow Swift's development, ABI stability has been one of its most ambitious projects and possibly it's defining feature, and it finally shipped in Swift 5. The result is something I find endlessly fascinating, because I think Swift has pushed the notion of ABI stability farther than any language without much compromise. ...

Ok so what's dynamic linking? For our purposes it's a system where you can compile an application against some abstract description of an interface without providing an actual implementation of it. ...To run properly, it must tell the system's dynamic linker about all of the interfaces it needs implementations for, which we call dynamic libraries (dylibs)...since it needs to be consistent over a long period of time, that ABI better be stable. So dynamic linking is our goal, and ABI stability is just a means to that end.

For our purposes, an ABI can be regarded as 3 things:

    The layout of types
    The calling convention of functions
    The names of symbols

If you can define these details and never break them, you have a stable ABI, and dynamic linking can be performed.

Now to be clear, ABI stability isn't technically a property of a programming language. It's really a property of a system and its toolchain. To understand this, let's look at history's greatest champion of ABI stability and dynamic linking: C.

All the major OSes make use of C for their dynamically linked system APIs. From this we can conclude that C "has" a stable ABI. But here's the catch: if you compile some C code for dynamic linking on Ubuntu, that compiled artifact won't work on MacOS? or Windows....Why? Because ABI is something defined by the platform...

But if that's the case, why don't platform vendors provide stable ABIs for lots of other languages? Well it turns out that the language isn't completely irrelevant here. Although ABI isn't "part" of C itself, it is relatively friendly to the concept. Many other languages aren't.

To understand why C is friendly to ABI stability, let's look at its much less friendly big brother, C++.

Templated C++ functions cannot have their implementations dynamically linked. If I provide you with a system header that provides the following declaration, you simply can't use it:

template <typename T> bool process(T value);

This is because it has no symbol. C++ templates are monomorphically compiled, which is a fancy way of saying that the way to use them is to copy-paste the implementation with all the templates replaced with a particular value...Now perhaps the platform could make a promise that it has precompiled several monomorphic instances, so say symbols for process<int> and process<bool> are available. You could make that work, but then the function wouldn't really be meaningfully templated anymore, as only those two explicitly blessed substitutions would be valid.

There would be little difference from simply providing a header containing:

bool process(int value); bool process(bool value);

...

Idiomatic Rust is similarly hostile to dynamic linking (it also uses monomorphization), and so an ABI-stable Rust would also end up only really supporting C-like interfaces.

...

1.3 Swift's Stable ABI

I have now made some seemingly contradictory claims:

    Swift has similar features to Rust
    Rust's features make it hostile to dynamic linking
    Swift is great at dynamic linking

The secret lies in where the two languages diverge: dynamism. Rust is a very static and explicit language, reflecting the sensibilities of its developers and early adopters. Swift's developers preferred a much more dynamic and implicit design, and so that's what they made.

As it turns out, hiding implementation details and doing more work at runtime is really friendly to dynamic linking.

...

But what's really interesting about Swift is the ways it's not dynamic.

It's actually fairly trivial to dynamically link a system where all the implementation details are hidden behind uniformity and dynamism. In the extreme case, we could make a system where everything is an opaque pointer and there's only one function that just sends things strings containing commands. Such a system would have a very simple ABI!

And indeed, in the 90's there was a big push in this direction with Microsoft embracing COM and Apple embracing Objective-C as ways to build system interfaces with simple and robust ABIs.

But Swift didn't do this. Swift tries its hardest to generate code comparable to what you would expect from Rust or C++, and how it accomplishes that is what makes its ABI so interesting.

...

It's worth noting that the Swift devs disagree with the Rust and C++ codegen orthodoxy in one major way: they care much more about code sizes (as in the amount of executable code produced). More specifically, they care a lot more about making efficient usage of the cpu's instruction cache, because they believe it's better for system-wide power usage. Apple championing this concern makes a lot of sense, given their suite of battery-powered devices.

It's harder for third party developers to care about this, as they will naturally only control some small part of the software running on a device, and typical benchmarking strategies don't really capture "this change made your application run faster but is making some background services less responsive and hurting battery life". Hence C++ and Rust inevitably pushing towards "more code, more fast".

...

1.4 Resilience and Library Evolution

The Swift developers cover this topic fairly well in their documentation https://github.com/apple/swift/blob/master/docs/LibraryEvolution.rst . I'll just be giving a simplified version, focusing on the basic motivation....It means that things default to having ABIs that are resilient to breaking when the implementation changes in an API-preserving way (nothing can save API-breaking changes)...When compiled as a dylib, Swift defaults to implicitly hiding as many details as possible, requiring you to opt into guarantees by adding annotations. Crucially, these annotations don't affect the shape of an API, they're "only" for optimizing the ABI, at the cost of resilience. Additionally, some ABI annotations can be added after a library has been published without breaking the old ABI. Applications compiled against new annotations are able to use that information to run faster, at the cost of compatibility with older versions of the library.

...This is all very abstract, let's look at a simple library evolution example.

Let's say we draft up a simple FileMetadata? interface in C:

version 1 typedef struct { int64_t size; } FileMetadata?;

bool get_file_metadata(char* path, FileMetadata?* output);

...

Now let's say we realize that this function should also provide info on when it was last modified:

version 2 typedef struct { int64_t last_modified_time; 64 bits CLEARLY enough... int64_t size; } FileMetadata?;

bool get_file_metadata(char* path, FileMetadata?* output);

Oops, we've messed up our ABI! Our hypothetical caller is stack allocating a FileMetadata?, so they're assuming it has a particular size and alignment. Additionally, they're directly accessing the size field, which they assume is at a particular offset in the struct.

...

Swift doesn't require you to make this compromise.

The following two designs are totally ABI compatible while remaining perfectly idiomatic to use:

version 1 public struct FileMetadata? { public var size: Int64 }

public func getFileMetadata(_ path: String) -> FileMetadata??

version 2 public struct FileMetadata? { public var lastModifiedTime: Int64 just add this field, that's it public var size: Int64 }

public func getFileMetadata(_ path: String) -> FileMetadata??

Unfortunately, guaranteeing the layout of FileMetadata? using the @frozen attribute in future versions would be an ABI breaking change under the current design. Hopefully it will be clear why by the end of this document.

2 Details

Ok! Now for the details, where I will in fact be ignoring the actual details and instead discussing the high level ideas behind them.

Once again, feel free to check out Swift's documentation of the annotations that are used to manage abi resilience https://github.com/apple/swift/blob/master/docs/LibraryEvolution.rst . That covers a lot of motivation and the fine-grain details of what you can and can't do.

2.1 Resilient Type Layout

By default, a type that is defined by a dylib has a resilient layout. This means that the size, alignment, stride, and extra inhabitants of that type aren't statically known to the application. To get that information, it must ask the dylib for that type's value witness table at runtime.

"Witness tables" are Swift's term for what are ultimately vtables. The details of how these tables are acquired and laid out don't really interest me, so I won't discuss that.

    Ok actually it is Interesting that Swift needs to be able to generate witness tables at runtime to deal with the fact that generic type substitutions can't be statically predicted in the face of dynamic linking of generic code, but that's getting way ahead of ourselves.

The value witness table is just the "vtable of basic stuff you might want to know about any type", much like how Java's Object type is used. So it has all the stuff like size, alignment, stride, extra inhabitants, move/copy constructors (for ARC), and destructors.

At this point those with experience in language design probably suspect this results in resilient types having to be boxed and passed around as a pointer. And those suspicions are indeed correct... but not quite.

See what's really interesting about resilient layout is that it's only something that the application is forced to deal with, and only in a very limited way. Inside the boundaries of the dylib where all of its own implementation details are statically known, the type is handled as if it wasn't resilient.

Inside the dylib a resilient struct is stored inline, stored on the stack, passed around by value, and even scalarized. But once we move outside the dylib something else must be done.

We could potentially accomplish this with expensive type layout conversion at the boundaries, but we don't! Type layouts are always the same on both sides of the resilience boundary!

Type layouts are always the same on both sides of the resilience boundary!?!?

Yes!

The key insight here is that laying out things inline can actually be done dynamically with relative ease. Memory allocators and pointers don't care about static layouts, they just work with completely untyped sizes, alignments, and offsets. So as long as you have all the relevant value witness tables, everything works basically fine, just with more dynamic values than usual.

The real major problem is stack allocations. llvm really doesn't like dynamic stack allocations. Yes, alloca does exist, but it's a bit messy. I believe the Swift devs managed to get it working all the time for resilient layout, but not for some of its cousins we'll see in the next section. In the general case, local variables need to actually be boxed up onto the heap. For convenience, I'll just generically refer to these dynamic stack allocations as "boxed".

Crucially this boxing this doesn't change layouts, just where local variables are stored and how they're passed in the calling convention (more on that later). Also, once there is some indirection everything is still stored inline. So types which already come with indirection like Array<MyResilientStruct?> or MyResilientClass? require no additional allocation, and consequently no ABI changes.

I've left out some key details, but let's address them while looking at polymorphic generics, since it turns out those are quite similar, but also more interesting! 2.2 Polymorphic Generics

Unlike Rust and C++ which must monomorphize (copy+paste) implementations for each generic/template substitution, Swift is able to compile a generic function into a single implementation that can handle every substitution dynamically.

This has several benefits:

    Massively reducing code size
    Massively reducing the amount of code that must be compiled
    Allowing generic code to be dynamically linked

A polymorphic implementation can't be inlined or optimized as well as a monomorphic one (without a JIT), so the Swift compiler still monomorphizes things when it's possible and seems profitable. But we're making a dylib, so it's not possible for our public API.

As it turns out, polymorphically compiled generic code is really quite similar to code that handles resilient types. In both cases the basic value-witnessy properties of the type aren't statically known, and so stack values need boxing. The generic code just needs to be able to find the generic type's protocol implementations too. We can get that from the type's protocol witness tables which can be acquired using the same machinery we use for the value witness tables.

So really this is basically the same problem!

💃 Brief Aside About Existentials 💃

The resilient/polymorphic type machinery solves a big chunk of the Object Safety problem that heavily limits Rust's trait objects. Swift calls these Protocols as Types or just existentials, depending on who you ask. Generic code actually having symbols means there's no problem with it being stuffed in a vtable. Resilient layout eliminates the problems that come with dynamic "by value" manipulation of Self and any of its associated types.

Existentials are the really tricky case for stack allocations, because they can prevent the caller from knowing the size of the return value before making the call, and that really messes up alloca. So once existentials get involved, alloca goes out the window and actual boxing needs to happen.

Also associated types in function signatures still prevent existentials from being created because that creates fundamental type system problems unrelated to ABI. Every instance of MyProtocol? could have a different associated type, and you can't let them get mixed up. No I'm not going to get into how Swift could use path-dependent types to deal with this.

Associated types are fine for normal polymorphic code, since generics enforce that every instance has the same type, which is the only issue with them in existentials.

Now, what does the presence of resilient/polymorphic types do to calling conventions?

Well first off, we have the witness tables. In the resilient case all the types are statically known, and so the implementation theoretically has all the information it needs to look up witness tables for itself. Polymorphic code has no such luxury.

Polymorphic code needs to work with any type, and structs don't contain any identifying runtime information. Worse yet, polymorphic code can be called without providing any values of that type! So the caller needs to pass in type information. Abstractly, we could pass in minimal information and have the polymorphic code look up all the witness tables, but that's really wasteful (consider calling the function in a loop). So instead Swift's actual implementation has the caller pass in pointers to every required witness table as extra arguments (automatically handled by the compiler).

With the witness tables handled, we just have the problem of passing/returning actual values. The main thing is that storing these kinds of values in registers is totally off limits; we really need to pass them around as pointers. That's sometimes a bit slower, but not a terribly huge deal given we're already signed up for dynamic linking and polymorphic compilation.

But to really understand passing these values, we need to talk about reabstraction.

. 2.3 Reabstraction

Resilient compilation forces us to use a particular calling convention that's different from what we would use statically. For instance, in the x64 SysV? ABI, the following would have all of its fields passed in registers:

struct Vec4 { int32_t x; int32_t y; int32_t z; int32_t w; }

void process(Vec4 vector);

If Vec4 were resilient, it would instead have to be passed by-reference. But remember, not all code that works with a type needs to handle it resiliently. For instance, if a dynamic library defines Vec4 resiliently, it should ideally still be able to handle it non-resiliently inside of itself.

Similarly, the polymorphicness (polymorphicity?) of things can be changed by their context. Consider the following Swift code that manipulates a closure:

closure is very generic func map<T, U>(_ input: T, with mapper: (T) -> U) -> U { return mapper(input) }

closure is kinda generic func mapInt(_ input: Int, with mapper: (Int) -> U) -> U { return map(input, with: mapper) just delegate to generic map }

actual closure isn't generic let result = mapInt(3, with: { return $0 >= 5 })

Assuming map and mapInt are compiled polymorphically, the closure has 3 potentially different ABIs:

    (Int) -> Bool
    (Int) -> U
    (T) -> U

The (Int) -> Bool ABI can potentially be ignored and discarded because we know we're passing it to something that expects (Int) -> U, but the (T) -> U ABI is completely hidden from us!

This naively results in an unfortunate conclusion: closures (and function pointers) must have the maximally generic ABI just in case that's needed. Thankfully, this conclusion is incorrect.

Instead Swift uses reabstraction thunks. These thunks simply wrap a function with the wrong ABI in a function with the right one. So what the compiler "actually" generates is more like this:

(note: not real Swift code because you can't explicitly talk about generics/conventions in this way)

closure is very generic func map<T, U>(_ input: T, with mapper: (T) -> U) -> U { return mapper(input) }

closure is kinda generic func mapInt(_ input: Int, with mapper: (Int) -> U) -> U { reabstract it! let thunk: <T,U>(T) -> U = { return mapper($0) }

    return map(input, with: thunk) // just delegate to generic map}

actual closure isn't generic let temp_closure: (Int) -> Bool = { return $0 >= 5 } reabstract it! let thunk: (Int) -> U = { return temp_closure($0) }

let result = mapInt(3, with: thunk);

In this way everything can use the best possible calling convention while still allowing for more generic ones to be used in different contexts.

Even without this closure passing problem, reabstraction also allows a single implementation to be used in several different contexts without having to compile different versions of it. So for instance we can reabstract a concrete protocol implementation into a polymorphic one by just wrapping all the functions in reabstraction thunks. A nice code size win!

(I believe the Swift devs don't technically call that one Reabstraction but it's close enough that I'm happy to conflate the concepts. Thunk away ABI complexities!)

Now that we have a basic idea of how resilience and polymorphism affects calling conventions, it should hopefully be clear why it's an ABI-breaking change to mark a type as @frozen, removing its resilient layout: it would change the way the type is passed to functions.

This could have potentially been "fixed" by making resilience part of the name mangle and providing both the resilient and non-resilient versions, but that requires robust versioning info for every attribute and could lead to a huge combinatoric explosion in the number of symbols a dylib provides. Not necessarily a great idea. 2.4 Materialization

Resilient layout could be generalized to provide the offsets to a resilient type's public fields, but Swift actually takes this to another level: public fields don't actually need to exist by default.

Resiliently exposed fields are only exposed as getters and setters!

Getters and setters are actually a first class feature of Swift that can be used explicitly, but for resilience the compiler will implicitly introduce those getters and setters just to hide the fact that the fields physically exist, in case you change your mind.

A computed field in Swift is manipulated in exactly the same way as a real one, and so even without resilience library authors are free to replace physical fields with computed ones without changing their API. Mandating computed access just makes it ABI stable as well.

But here's the catch: you can take a mutable reference (inout) to a field. Even a computed one!

Swift let's you take references to things that don't exist.

The secret to this is materialization.

Inouts can only appear as arguments to a function, and so they're naturally scoped to a function call. As such, we can "take a reference" to a computed field by using a temporary with cleanup. So this code:

struct MyStruct? { public var myField: FieldTy? }

func doTheThing(x: inout FieldTy?)

var myVal = MyStruct?(..) doTheThing(&myVal.myField)

is (very roughly) compiled to:

var myVal = MyStruct?(..)

var temp: FieldTy? = myVal.get_myField(); doTheThing(&temp) myVal.set_myField(temp)

Or to break that into steps:

    Allocate a temporary local variable (temp)
    Initialize temp with the getter
    Call the inout-using function with the address of temp
    Feed the value of temp into the setter

This creates a very interesting difference from references in Rust and C++: if you take a reference to a field, it may point to a temporary that's only valid for the scope of the call it was passed to!

Another interesting consequence of this is that Swift's mutable references all actually have a finalizer which must be executed for a write to "stick". This means they cannot be returned or stored, as the finalizer would be lost and the referent would be deallocated.

This in turn creates a hilarious footgun many Swift developers run into where they think they're clever and convert an inout into a raw pointer (using withUnsafeMutablePointer) that they store for later and -- oops it's dangling!

However you can overcome this limitation with callbacks, as follows:

func doSomeWork(_ callback: (inout Float) -> ()) { var vec = Vec4() do lots of work...

    // "return" a mutable reference to the caller
    callback(&vec.x)
    // setter potentially called here to commit the writes

    // maybe do some more work...}

doSomeWork { (val: inout Float) -> () val *= 2; }

Callbacks are of course very annoying and noisy, and so this was solved with the slayer of callbacks, coroutines! The same code can be rewritten as:

No idea if this is valid Swift syntax, and I don't care!

func doSomeWork() -> inout Int { var vec = Vec4() do lots of work...

    // "return" a mutable reference to the caller
    yield &vec.x
    // setter potentially called here to commit the writes

    // maybe do some more work...}

doSomeWork() *= 2

Nifty! 2.5 Ownership

Swift makes extensive use of reference counting, and as it turns out it's really expensive to constantly modify the count! Especially when you make all your collections copy-on-write (CoW?), so an errant reference count bump can make an O(n) algorithm O(n2)! (I actually consider this a correctness error in the case of data structures and algorithms, but reasonable people may disagree.)

To help deal with this, Swift made ownership of reference-counted values (~classes) part of its calling convention. The most important aspect is "+0" vs "+1", referring to how the caller should change the refcount:

    +1: callee has an "owned" value it's responsible for releasing (if it doesn't escape)
    +0: callee has a "borrowed" value it's responsible for retaining (if it escapes)

Since we're talking about ownership, I'm legally required to compare this system to Rust, and the comparison is pretty straight-forward:

    +1 is a move (T)
    +0 is a shared immutable reference (&T)
    inout is a mutable exclusive reference (&mut T)

But there's a few key differences, due to ARC:

First and foremost, classes break "shared xor mutable" reasoning, and are effectively like using Arc<UnsafeCell?<T>> in Rust. This is why Swift's collections provide CoW? semantics, which Rust's Arc also safely provides. Yes, you can indeed trivially introduce Data Races into Swift code with classes.

Second, all Swift types can always be implicitly cloned, so a +0 can always be upgraded to an owned value without ceremony. That said, cloning in Swift (they just call it copying) is always just bumping reference counts. Other non-trivial operations, like copying an array's buffer to a new allocation, are only performed by mutations that trigger CoW?. Note also that this means that if you trigger CoW? on an array of arrays, you will get two independent outer arrays that still point to the same inner arrays (which are now primed to CoW? if either side mutates them).

Third, +0 isn't strictly bound to pass-by-reference, and can just be a trivial bitwise copy of the value. Unfortunately, weak references require non-trivial moves because their locations are tracked for auto-nulling, so those are passed by reference when using +0 to keep it cheap. Yes, Swift has both copy and move constructors, although they're currently entirely ARC and not user-defined. Swift also has unowned references which are the same as Rust's Weak references, which have trivial moves.

In Swift's current design, +0/+1 is mostly just something the compiler does internally to optimize different calling conventions, but I think more explicit annotations are theoretically on the road map.

There's also a special path for field materialization, "modify", which returns an inout. This handles the fact that getters are naively +1, which is especially nasty for nested array operations like array[0][2] *= 2, as they would always trigger a huge temporary copy of the inner array!

And indeed, the subscript operator of Array contains a modify implementation:

 _modify {
  _makeMutableAndUnique()
  _checkSubscript_native(index)
  let address = _buffer.subscriptBaseAddress + index
  yield &address.pointee}

(Interestingly, this implementation is marked as inlineable, and so it's actually guaranteed to always work. Array's ABI details were pretty aggressively guaranteed since it's a relatively simple fundamental type whose performance matters a lot.)

There's also a read-only version of modify, "read", which provides a +0 getter. That one isn't as strongly motivated, but hey it's a nice little optimization to avoid a needless retain+release.

...

Swift used to have copy-array-of-self and move-array-of-self entries in the value witness table to deal with the "calling a bunch of copy/move constructors which might do nothing interesting" problem that C++ has, but this was replaced by just adding a flag to the witness table for whether the type is trivial for copies/moves. Less bloaty.

Swift decouples size from stride to distinguish trailing padding from actual content. Trailing padding shows up in the stride so that arrays properly align their contents. Trailing padding doesn't show up in size so that things like enum tags and neighboring fields can use that space. It's really neat!

...

The main piece of complexity in the system for getting witness tables is that it provides a runtime reflection system for spidering through the type metadata of generic types and their associated types because you need to be able to get their witnesses too.

---

https://www.reddit.com/r/rust/comments/dtqm36/how_swift_achieved_dynamic_linking_where_rust/f6yh5jy/

" > You can actually already do dynamic linking in Rust, but only with C-based interfaces.

The C-like interfaces that gankra talks about are I belive more similar to current Rust than to C, so I think they shouldn't be called C-like. They could support structs, enums, trait objects, lifetimes, slices, etc. "Stable ABI Rust" would be a name that's more fair. Addition of generator based async and generics of course won't work, but not all interfaces actually need those features.

I think there is definitely potential for a "better dynamic linking" RFC that adds a stable ABI to the language. Of course, the default compilation mode should keep using unstable ABIs, generics, async, etc. But allowing crate writers to opt into a stable ABI would be beneficial for precompiled crates as well as compile times which are still the biggest problem with the Rust compiler for many people. "

---

oautholaf 1 day ago [-]

Here's one thing I don't understand: In addition to enabling dynamic linking, this mechanism allows Swift to compile less code (generic functions only need to be compiled once) and therefore reduce instruction cache misses.

But certainly the tradeoff for this type-info-vtable "witness table" and much more heavy boxing must impact the data cache miss rate. What's the tradeoff end up being in practical terms? Is it worth it?

Also, although it seems there's features that let you "freeze" a type, is there a practical way that a non-language expert could ever profile and discover that they may want to use such a feature to improve performance?

Especially given that Swift programs on iOS probably only dynamically link to the system libraries, this seems like a powerful language feature to enable something that could have been achieved by writing the Swift system libraries as typed facades around a typeless core, which you so often see in C++ STL implementations.

---

this guy argues that if you have continuations at all, you should have multi-prompt delimited continuations, but that probably these aren't worth it: https://rain-1.github.io/scheme-continuations.html

so it's probably worth reading his blog "This page is for documenting things learned from implementing tarot, a basic self hosted scheme compiler that I did.": https://rain-1.github.io/scheme

---

some notes about Golang binary filesizes:

https://www.cockroachlabs.com/blog/go-file-size/

A lot of the binary file size is from a lookup table mapping program counters to line numbers (for stacktrace purposes). In earlier versions of Golang, this table was compressed and then decompressed upon program initialization, but now, to speed up initialization, it is stored uncompressed, and so the binary size is larger.

One commentor says that storing uncompressed also has the benefit of using less memory, because you don't have to decompress [13]. However, another (pcwalton) says, "No, there's no need for a tradeoff at all in a properly implemented compiler. You can certainly compress a table in memory and still have that table be indexable at runtime. Entropy coding is not the only type of compression that exists. DWARF solved this a decade ago." [14].

"Go uses memory instead of registers to pass arguments and return values across function calls. On x86/x86-64, memory-accessing instructions use longer machine encodings than register-only instructions.In my experience, this is specifically the inflection point for the ratio between source code size and compiled code size in a monomorphic C-like language: when targeting fixed-length instruction encodings and/or a memory-heavy calling convention, the size of the compiled code grows larger than the source code (excluding comments and whitespace). We can see this with both C on ARMv6 (no Thumb) or Go on x86(-64).When targeting variable-length instruction encodings and the calling convention suitably utilizes the more compact instructions, the size of the compiled code becomes smaller than the source code. We can see this with C on x86(-64) with a decent compiler, but, as examined here, not with Go."

---

"There are efforts to define the behavior in cases where implementations have converged or died out (e.g., twos complement, shifting into the sign bit)."

---

cataphract 1 day ago [-]

Signed overflow being undefined behavior allows optimizations that wouldn't otherwise be possible

Quoting http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

> This behavior enables certain classes of optimizations that are important for some code. For example, knowing that INT_MAX+1 is undefined allows optimizing "X+1 > X" to "true". Knowing the multiplication "cannot" overflow (because doing so would be undefined) allows optimizing "X*2/2" to "X". While these may seem trivial, these sorts of things are commonly exposed by inlining and macro expansion. A more important optimization that this allows is for "<=" loops like this:

> for (i = 0; i <= N; ++i) { ... }

> In this loop, the compiler can assume that the loop will iterate exactly N+1 times if "i" is undefined on overflow, which allows a broad range of loop optimizations to kick in. On the other hand, if the variable is defined to wrap around on overflow, then the compiler must assume that the loop is possibly infinite (which happens if N is INT_MAX) - which then disables these important loop optimizations. This particularly affects 64-bit platforms since so much code uses "int" as induction variables.

---

    “One of the things is that LLVM has several layers of IR while Cranelift has only one. Another is that Cranelift does use a graph coloring register allocator, but simply a dumber one, thus being faster.”

on rustc_codegen_cranelift (cg_clif for short) : " “I have the freedom to change what I want whenever I want. Sometimes there are problems I can’t solve myself though as I am not familiar enough with the respective area. For example object files, linkers and DWARF debuginfo. Luckily I know people who do know a lot about those things.”."

"There are bits and pieces missing, such as partial SIMD support, ABI Compatibility, unsized values and many more. There’s also lack of feature parity with Cranelift itself, such as cg_clif being blocked because Cranelift doesn’t support a feature LLVM does. However, these problems are shrinking and most crates do build today."

 Patrick Meredith • 2 hours ago • edited

What about stopping compilation at the LLVM bitcode file level and have a "debug exe" (shell script/bat script) that calls lli on the bitcode? That's the only part here that's wrong, LLVM is still a VM. lli still exists.

https://www.llvm.org/docs/C... https://disq.us/url?url=https%3A%2F%2Fwww.llvm.org%2Fdocs%2FCommandGuide%2Flli.html%3A7u4-AVK8VKjVvp66mRt4k95Z9zU&cuid=3529560

Here's even some instructions for running rust with lli:

https://www.graalvm.org/doc...

dwheeler 3 hours ago [-]

One cool advantage of having multiple compilers for a language is that you can use one as a check on the other.

For example, if you're worried that one of the compilers might be malicious, you can use the other compiler to check on it: https://dwheeler.com/trusting-trust

Even if you're not worried about malicious compilers, you can generate code, compiled it against multiple compilers, and sending inputs and see when they differ in the outputs. This has been used as a fuzzing technique to detect subtle errors in compilers.

---

https://www.infoq.com/news/2020/05/swift-5-3-windows-linux/

As a first outcome of the commitment to bring Swift to Linux, the Swift team has announced the availability of new Swift Linux distributions, including Ubuntu 20.04, CentOS? 8, Amazon Linux 2. Porting Swift to CentOS? and Amazon Linux required a number of subtle changes such as switching to a different libcurl version for FoundationNetworking?, adapting the Swift package manager to Fedora packaging system, and dropping the dependency on libatomic. For each supported platform, the Swift team provides a downloadable toolchain and Docker images.

---

Tsarbomb 4 hours ago [–]

Except you need to make sure you built against the correct version of libc because you cannot statically link that.

I remember when docker first came out and people were losing their minds about how they will be able to deploy their applications with all their dependencies in a single shippable bundle, kind of like an uber jar.

nitsky 4 hours ago [–]

Rust programs can be statically linked to musl libc, producing binaries that will run on any linux distribution. You can run ldd on the binaries and confirm they have zero dependencies on shared libraries.

touisteur 4 hours ago [–]

Isn't that what go is trying to alleviate, with their use of direct syscalls and no use of libc?

---

https://scholarworks.iu.edu/dspace/handle/2022/24749

A data parallel compiler hosted on the GPU Hsu, Aaron W. Advisor: Andrew Lumsdaine Keywords: compilers; tree transformations; GPU; APL; array programming URI: http://hdl.handle.net/2022/24749 Date: 2019-11 Publisher: [Bloomington, Ind.] : Indiana University Type: Doctoral Dissertation Abstract: This work describes a general, scalable method for building data-parallel by construction tree transformations that exhibit simplicity, directness of expression, and high-performance on both CPU and GPU architectures when executed on either interpreted or compiled platforms across a wide range of data sizes, as exemplified and expounded by the exposition of a complete compiler for a lexically scoped, functionally oriented programming commercial language. The entire source code to the compiler written in this method requires only 17 lines of simple code compared to roughly 1000 lines of equivalent code in the domain-specific compiler construction framework, Nanopass, and requires no domain specific techniques, libraries, or infrastructure support. It requires no sophisticated abstraction barriers to retain its concision and simplicity of form. The execution performance of the compiler scales along multiple dimensions: it consistently outperforms the equivalent traditional compiler by orders of magnitude in memory usage and run time at all data sizes and achieves this performance on both interpreted and compiled platforms across CPU and GPU hardware using a single source code for both architectures and no hardware-specific annotations or code. It does not use any novel domain-specific inventions of technique or process, nor does it use any sophisticated language or platform support. Indeed, the source does not utilize branching, conditionals, if statements, pattern matching, ADTs, recursions, explicit looping, or other non-trivial control or dispatch, nor any specialized data models. Description: Dissertation (Ph.D.) - Indiana University, School of Informatics, Computing, and Engineering, 2019

Show full item record Files in this item Icon Name: Hsu Dissertation.pdf Size: 29.25Mb Format: PDF View/Open This item appears in the following Collection(s)

    Theses and Dissertations [543]

---

https://speice.io/2018/09/primitives-in-rust-are-weird.html

if in Rust you call 8.to_string(), Rust uses to_string as a static function and passes in the integer 8 (rather than constructing some sort of OOP boxed 8 representation and then calling a to_string method on it)

---

crazypython 17 hours ago [–]

GCC 10 supports Dlang directly. (The support is still immature and it's being updated to the newer version of Dlang.)

Dlang compiles quickly not because the language is simple, but because the compiler authors care about compilation speed:

Lexer skips four spaces at once https://github.com/dlang/dmd/pull/11095

Optimize core logic to 7x compilation speed of language feature https://github.com/dlang/dmd/pull/11303

Cache rarely used types to improve memory locality https://github.com/dlang/dmd/pull/11363

---

estebank 13 hours ago [–]

It's not that Rust doesn't care about compilation speed, is that final binary speed, zero cost abstractions and ergonomics lie higher in the priority list. If we didn't care about ergonomics you wouldn't be able to have diamond dependencies inside of a crate (like defining a struct in a file and it's impl in a different one, or splitting impl items between two files, this is very handy but requires that the compilation unit be a crate). If we didn't care about zero cost abstractions and final binary speed we wouldn't always monomorphism generics and have a threshold where we automatically use dyn traits instead (vtables), like Swift does, but we want people to have full control over the behavior of their code so you have to change your code in order for that behavior to change. This means that your code's performance characteristics won't change because of unrelated changes crossing some threshold, but it does mean that unless you opt into using a vtable your code will spend a lot of time expanding code for every type used in generics. This doesn't mean that we don't care about performance or that there aren't things we can do to improve the current situation or that we don't actively pursue those actions (take a look at https://perf.rust-lang.org/ over larger periods of time and you'll see that for the most part things are getting significantly better).

---

entha_saava 2 hours ago [–]

Type checking doesn't seem to be major bottleneck at all. The major issues seem to be

1) Technical debt in rustc producing large amounts of LLVM IR and expecting LLVM to optimize it away

2) generic monomorphization producing vast amount of IR

IIRC.

---

roca 17 hours ago [–]

Depends on what you mean by "research". If you mean "problems with a high degree of novelty" then quite a lot of research, I think. Rust is unusual: it requires a lot of tricky type inference and static analysis from the compiler, and doesn't make developers structure their code with header files etc to make separate compilation easier; BUT unlike most languages with those properties, people are writing large real-world projects in it.

---

jchw 11 hours ago [–]

Something that surprised me: Compilers happily optimize pointer to member function calls out if it knows definitely what the pointer points to, but not if the pointer is to a virtual function. Recently figured that out thanks to a slightly related debate on whether it would optimize the first case at all - which I thought it would - but I’m surprised by the second case, since it’s not like there’s a virtual dispatch to worry about once you’ve already taken a reference - you’re already specifying exactly which one implementation you want. Just one of many nuggets learned from Godbolt: apparently marking a function virtual can kill optimizations even in cases where it seems like it shouldn’t.

KMag 10 hours ago [–]

Where's your point of confusion? There's still virtual dispatch going on with a pointer to virtual member function... they're typically implemented as fat pointers containing the vtable offset, and calling via the pointer still does virtual dispatch.

An alternative would be for the compiler to generate a stub function (can be shared across all classes for cases where the same vtable offset is used, unless your architecture has some very exotic calling convention) that jumps to a fixed offset in the vitable, and use a skinny pointer to that stub as a virtual member function pointer, but that's not the usual implementation.

In any case, calling through a pointer to virtual member function has to perform virtual dispatch unless the compiler can statically constrain the runtime type of *this being used at the call site. Remember that C++ needs to allow separate compilation, so without a full-program optimizer that breaks dlopen(), C++ can't perform devirtualization on non-final classes.

Making the class final will allow the compiler to statically constrain the runtime type and devirtualize the pointer to member function call.

Edit: added paragraph break.

jchw 8 hours ago [–]

Ahhh, nevermind. I was under the impression that when you grabbed a pointer to member function, it was grabbing a specific implementation. I actually did not know it supported virtual dispatch.

Now I think I finally understand why MSVC pointer to member functions are 16 bytes wide; I knew about the multi-inheritance this pointer adjustment, but I actually had no idea about virtual dispatch through member pointers. Frankly, I never used them in a way that would’ve mattered for this.

---

" Variables don't always need to be shadowed; sometimes they're safe to clobber. Rather than taking on responsibilities of analyzing lifetimes, Mu provides a single simple rule: the first variable in a register shadows previous values, and subsequent variables to the same register in the same block clobber:

{ var x/ecx : int ← copy y var z/ecx : int ← …A… …B… }

⇒

{ push ecx ecx ← copy y ecx ← …A… …B… pop to ecx }

To shadow, create a new inner block. " -- http://akkartik.name/post/mu-2019-2

---

https://jamesmcm.github.io/blog/2020/07/25/intro-dod/

the section of most interest to me is 'dynamic dispatch vs. monomorphisation'

---

"One neat trick I pulled from otcc is the symbol patching mechanism. Namely, when we don't know the value of a given symbol yet while emitting code we use the bits dedicated to this value in the machine code to store a linked list of all locations to be patched with the given symbol value. See the code of the patch function for details about that. "

---

miloignis 33 days ago

parent

context

next [–]

on: Functional programming is finally going mainstream

Ah, but you don't always have to copy memory around to facilitate immutable datastructures. Koka lang and Roc lang are at the forefront of a functional-but-in-place (FBIP) style of memory management that use hyper-fast ref-counting and if a function owns the sole reference to a datastructure can mutably modify it instead of deallocating & reallocating. The most recent Perceus reference counting paper, implemented in Koka, had Koka's functional and persistent Red-Black tree 10% faster than the handwritten C++ version (std::map) [0], and the Roc language has a purely functional quicksort version competitive with standard imperative languages.

So it seems we can have out cake and eat it too!

[0] - https://www.microsoft.com/en-us/research/uploads/prod/2021/1...

nyanpasu64 33 days ago

parent

next [–]

Does Perceus differ substantially from using https://lib.rs/crates/im and writing code which looks identical to imperative code, but cloning is faster and mutations are slower (and copy internal structures as necessary)?

miloignis 33 days ago

root

parent

next [–]

It does the same mutate-if-unique optimization as im, but it does so automatically and more broadly as a part of its optimized reference counting. Perceus can go beyond what im can do by re-using allocations between static types - one of their examples involves doing a tree traversal with O(1) extra space by using a zipper which gets optimized by Perceus into a Morris traversal - and the programmer doesn't have to do anything special to get the benefit! (of course, writing code in such a way that you know it will be optimized has performance benefits if you do know how it works) This sets up this wonderful situation where you can write easier-to-maintain and easier-to-verify-correct algorithms that get optimized into very efficient versions that would be hard to write by hand even in a mutable/imperative language.

Their papers and online docs are very good, I highly recommend them for more information!

---

hayley-patton 71 days ago

unvote

parent

context

next [–]

on: Rust without the async (hard) part

> imho the issue isn't function coloring, threads, whatever. It's a compiler that defaults to async code in the calling convention and then optimization passes to de-async-ify (remove unnecessary yield points) the code at compile time. The result would be code that looks synchronous but is async where it matters (i/o).

Safepoints for garbage collection are somewhat similar, but for preemption one wants to interrupt threads on a timer, rather than before the collector takes over. Despite occurring very frequently (at around 100 _million_ checks per second), the time overhead is only about 2.5% or so, according to a study by Blackburn et al [0]. It appears, I think, that as long as the fast not-interrupting path is fast enough, eliminating safepoints isn't too important.

[0] Stop and Go: Understanding Yieldpoint Behaviour <https://users.cecs.anu.edu.au/~steveb/pubs/papers/yieldpoint...>

---

[15]

[16]

---

5 Corbin 24 hours ago

link

flag

    if you want to support a single language in many tools, go for hand-written everything

Everything you’re saying is resonating with me, and this point feels like the biggest takeaway for today. I’m currently implementing a language as a collection of standalone executables, each in its own language, and each built for a single task. Composition of these standalone actions is sufficient for my current goals.

I wonder if this is how GCC evolved, particularly the -pipe idiom for passing code between compiler stages.

What’s strange for me is that I did not anticipate the list of languages:

    I used a Scheme (CHICKEN) for my typechecker, so that I can write my type relation directly in miniKanren
    I need Rust for my optimizer, so that I can use egg with custom rules
    I chose S-expressions for ASTs, including high-level user input, so that I don’t have to write parsers; CHICKEN and egg natively support S-expressions
    I chose PureScript for my JavaScript compiler
    I chose RPython (Python 2.7 subset) for my standalone runtime, so that I can have a JIT
    And of course there is some JavaScript, Python, and Nix gluing everything together

There was a time when I had tried OCaml for everything. Before that, it was RPython for everything, and before that, CHICKEN for everything. But none of those approaches were worthwhile; I ended up with too much duplicated code.

---

Zig's new bootstrapping process:

[17] [18] [19]

---

https://blog.nelhage.com/post/why-sorbet-is-fast/

---