(Simon Marlow on GHC:)
simonmar on Sept 4, 2010 [-]
Oh yes, immutability is crucial. Generational GC already makes you pay for mutation with a write barrier, and in our parallel GC we omit the locking when copying immutable objects, accepting that a few might get copied twice. In the new GC mutable objects become even more expensive. I don't think local GC is viable at all in a language with ubiquitous mutation.
---
Roboprog on Sept 3, 2010 [-]
An interesting approach: giving each thread its own "young generation" sub-heap, so transient objects can be disposed of without coordination from other threads / CPUs and their cache pages.
scott_s on Sept 3, 2010 [-]
I worked on a memory allocator (as in malloc, not garbage collection) that takes a similar approach: http://people.cs.vt.edu/~scschnei/streamflow/
A group at Intel independently came up with a similar approach as well: http://portal.acm.org/citation.cfm?id=1133967
Roboprog on Sept 6, 2010 [-]
Cool, the steamflow thing sounds interesting. Is there a top level example or test-driver somewhere in the github project showing what typical use-cases are?
E.g. - I have a test-driver here: http://github.com/roboprog/buzzard/blob/master/test/src/main... (although I have barely started the library I was tinkering on)
Any example client program for your allocator? I'd like to see what use cases you are handling.
scott_s on Sept 6, 2010 [-]
The Larson and Recycle benchmarks are on github. You can read about them in the paper. Email me if you'd like to see an unpublished paper which has some more detail on the allocator's design.
larsberg on Sept 3, 2010 [-]
We've been doing this in Manticore since 2008 or so. We couldn't really get speedups past 12 cores without it (we have a "typical" parallel GC implemented as well to test against). Hopefully we'll get the paper on the GC and associated language trickery - we don't allow pointers between young generations - in somewhere soon :)
simonmar on Sept 3, 2010 [-]
Yes, the GHC design has certainly been influenced by Manticore (that was one of the "other designs" I referred to). Though in GHC we do have some different problems to solve, the worst of which is that we have to support a bunch of programming abstractions that use mutation.
---
interesting (unimplemented?) PEP on Python bytecode verification: http://legacy.python.org/dev/peps/pep-0330/
---
some things we might want to use machine registers for when building an OVM interpreter in assembly:
---
"
dyncall library The dyncall library encapsulates architecture-, OS- and compiler-specific function call semantics in a virtual bind argument parameters from left to right and then call interface allowing programmers to call C functions in a completely dynamic manner. In other words, instead of calling a function directly, the dyncall library provides a mechanism to push the function parameters manually and to issue the call afterwards. This means, that a program can determine at runtime what function to call, and what parameters to pass to it. The library is written in C and assembly and provides a very simple C interface to program against. "
---
i think i already noted this somewhere but some of these comments are a good read:
https://news.ycombinator.com/item?id=10032295
---
" up vote 13 down vote accepted
There are a number of papers on different kinds of dispatch:
M. Anton Ertl and David Gregg, Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters, in Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI 03), pp. 278-288, San Diego, California, June 2003.
M. Anton Ertl and David Gregg, The behaviour of efficient virtual machine interpreters on modern architectures, in Proceedings of the 7th European Conference on Parallel Computing (Europar 2001), pp. 403-412, LNCS 2150, Manchester, August 2001.
An excellent summary is provided by Yunhe Shi in his PhD? thesis.
Also, someone discovered a new technique a few years ago which is valid ANSI C.
"
---
Doug's C# parser framework:
https://github.com/drubino/Linq.Parsers
---
"...can use FASM, which is self-hosting, if I can't compile NASM or YASM. " [1]
---
" Context-free grammars. What this really means is the code should be parsable without having to look things up in a symbol table. C++ is famously not a context-free grammar. A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating most of a compiler front end. As a result, third-party tools become much more likely to exist. Redundancy. Yes, the grammar should be redundant. You've all heard people say that statement terminating ; are not necessary because the compiler can figure it out. That's true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages. "
"The first tool that beginning compiler writers often reach for is regex. Regex is just the wrong tool for lexing and parsing. Rob Pike explains why reasonably well."
" The philosophies of error message handling are:
Print the first message and quit. This is, of course, the simplest approach, and it works surprisingly well. Most compilers' follow-on messages are so bad that the practical programmer ignores all but the first one anyway. The holy grail is to find all the actual errors in one compile pass, leading to:
Guess what the programmer intended, repair the syntax trees, and continue. This is an ever-popular approach. I've tried it indefatigably for decades, and it's just been a miserable failure. The compiler seems to always guess wrong, and subsequent messages with the "fixed" syntax trees are just ludicrously wrong.
The poisoning approach. This is much like how floating-point NaNs are handled. Any operation with a NaN operand silently results in a NaN. Applying this to error recovery, and any constructs that have a leaf for which an error occurred, is itself considered erroneous (but no additional error messages are emitted for it). Hence, the compiler is able to detect multiple errors as long as the errors are in sections of code with no dependency between them. This is the approach we've been using in the D compiler, and are very pleased with the results."" Runtime Library
Rarely mentioned, but critical, is the need to write a runtime library. This is a major project. It will serve as a demonstration of how the language features work, so it had better be good. Some critical things to get right include:
I/O performance. Most programs spend a lot of time in I/O. Slow I/O will make the whole language look bad. The benchmark is C stdio. If the language has elegant, lovely I/O APIs, but runs at only half the speed of C I/O, then it just isn't going to be attractive.
Memory allocation. A high percentage of time in most programs is spent doing mundane memory allocation. Get this wrong at your peril.
Transcendental functions. OK, I lied. Nobody cares about the accuracy of transcendental functions, they only care about their speed. My proof comes from trying to port the D runtime library to different platforms, and discovering that the underlying C transcendental functions often fail the accuracy tests in the D library test suite. C library functions also often do a poor job handling the arcana of the IEEE floating-point bestiary — NaNs, infinities, subnormals, negative 0, etc. In D, we compensated by implementing the transcendental functions ourselves. Transcendental floating-point code is pretty tricky and arcane to write, so I'd recommend finding an existing library you can license and adapting that.
A common trap people fall into with standard libraries is filling them up with trivia. Trivia is sand clogging the gears and just dead weight that has to be carried around forever. My general rule is if the explanation for what the function does is more lines than the implementation code, then the function is likely trivia and should be booted out."---
" String I/O should be unicode-aware & support utf-8. Binary I/O should exist. Console I/O is nice, and you should support it if only for the sake of having a REPL with readline-like features. Basically all of this can be done by making your built-in functions wrappers around the appropriate safe I/O functions from whatever language you’re building on top of (even C, although I wouldn’t recommend it). It’s no longer acceptable to expect strings to be zero-terminated rather than length-prefixed. It’s no longer acceptable to have strings default to ascii encoding instead of unicode. In addition to supporting unicode strings, you should also probably support byte strings, something like a list or array (preferably with nesting), and dictionaries/associative arrays. It’s okay to make your list type do double-duty as your stack and queue types and to make dictionaries act as classes and objects. Good support for ranges/spans on lists and strings is very useful. If you expect your language to do string processing, built-in regex is important. If you provide support for parallelism that’s easier to manage than mutexes, your developers will thank you. While implicit parallelism can be hard to implement in imperative languages (much easier in functional or pure-OO languages), even providing support for thread pools, a parallel map/apply function, or piping data between independent threads (like in goroutines or the unix shell) would help lower the bar for parallelism support. Make sure you have good support for importing third party packages/modules, both in your language and in some other language. Compiled languages should make it easy to write extensions in C (and you’ll probably be writing most of your built-ins this way anyway). If you’re writing your interpreted language in another interpreted language (as I did with Mycroft) then make sure you expose some facility to add built-in functions in that language. For any interpreted language, a REPL with a good built-in online help system is a must. Users who can’t even try out your language without a lot of effort will be resistant to using it at all, whereas a simple built-in help system can turn exploration of a new language into an adventure. Any documentation you have written for core or built-in features (including documentation on internal behavior) should be assessible from the REPL. This is easy to implement (see Mycroft’s implementation of online help) and is at least as useful for the harried language developer as for the new user. "
---
ufo 7 days ago [-]
Deoptimization is actually really hard to implement if you have an ahead of time compiler like Cannoli. You need to get all the stuff that is living in machine registers or in the C stack and then convert them back to whatever representation your generic interpreter uses.
I think this is actually one of the things that most get in the way if you want to use traditional AOT compiler technology (like gcc or LLVM) to implement a JIT. In state of the art JIT compilers this part is always a nest of highly complex and nonportable assembly language.
reply
---
tathougies 7 days ago [-]
Compiling to machine code is not a panacea for optimization. A optimized JIT compiler is going to blow an AOT compiler out of the water. Being smart about the machine code generated is significantly more important than generating machine code. In particular, PyPy? makes several optimizations over python code that a more direct implementation of CPython at the machine level probably wouldn't. For example, PyPy? erases dictionary lookups for object member access if the object shape is statically known. Given how prevalent this kind of lookup is in Python code, it's possible that even an interpreter that made this optimization would be faster than a machine code version that used an actual hash table.
I think this compiler also makes this particular optimization, but this is just one of many many optimizations PyPy? does. I imagine that with sufficient work, this compiler could be brought up to speed with PyPy?, but as it stands right now, PyPy? simply benefits from having years of optimization work that a new project doesn't.
reply
ori_b 7 days ago [-]
For most dynamic languages, the available speedups aren't in simple compilation, but in removing the runtime type checks, method lookups, and other slow operations. This needs the ability to guess at what the code is going to do based on past behavior, and generate specialized versions that get thrown away if the guesses are invalidated.
So, for example, you might see that the last 100 calls to a function were done with integers, so you can generate a variant of the function that only works for integers, and check if it's applicable when you enter the function. If that function stops getting used, you can throw it away.
Doing that well ahead of time requires an extremely good idea of how the program will behave at run time, and even with good information, is still very likely to bloat up your binary hugely. (Facebook used to compile their PHP codebase to a multi-gigabyte binary before moving to HHVM, for example).
reply
collyw 7 days ago [-]
Actually I think you answered a question that I already asked about Rust being faster than C. If you don't need to carry out as many checks then I see how that will speed things up.
reply
xapata 7 days ago [-]
JITs get to analyze both code and data and optimize for each machine deployed to. A static compiler can only analyze code and the machine used for compilation. If dependencies were pre-compiled, the static compiler won't be able to optimize their relationship with the project. If the machine is changed for deployment.
More information means better optimizations. JITs FTW.
reply
jcranmer 7 days ago [-]
JITs also tend to optimize for compilation time over final performance. Doing ahead-of-time superoptimization or polyhedral loop transformation isn't going to happen in a JIT.
reply
xapata 7 days ago [-]
There's no restriction on the kind of optimization a JIT can do. Perhaps there's a current implementation tendency. In contrast, ahead-of-time (data- and platform-ignorant) compilers are restricted.
reply
---
example of compilation to Rust:
https://github.com/joncatanio/cannoli https://github.com/joncatanio/cannoli/blob/master/resources/papers/cannoli-thesis-paper.pdf
---
" 4.1 Why Compile Python to Rust? We considered three different intermediate representations to compile Python code into. The first was LLVM [17], to leverage the many optimizations implemented by the LLVM compiler. Targeting LLVM, however, would require implementing a garbage collector (or simply ignoring memory management for this prototype). The implementation would be simplified by implementing a standard library to handle various elements of the language, but writing this library in LLVM was considered to not be ideal. Writing the library in C, compiled to LLVM code, would eliminate some of the complexity of writing the library directly in LLVM. Following this idea further we considered targeting C and compiling against a library written in C. Unfortunately, this does not address the issue of memory management. Targeting Rust and compiling the output Rust code against a library written in Rust was considered the best of both worlds. We get a large number of optimizations from the Rust compiler (and the LLVM compiler) as well as memory management provided by Rust’s ownership rules. Therefore, Cannoli is a Python compiler written in Rust that compiles Python code into Rust code. " -- [5]
---
chubot 23 days ago
| parent [-] | on: Python startup time: milliseconds matter |
This is disappointing to me too, but I think there are some problems baked in to the language that make it hard.
The import code in CPython was a mess, which was apparently cleaned up by importlib in Python 3, through tremendous effort. But unfortunately I think importlib made things slower?
I recall a PyCon? talk where as of 3.6, essentially everything about Python 3 is now faster than Python 2, EXCEPT startup time!
This is a shame, because I would have switched to Python 3 for startup time ALONE. (As of now, most of my code and that of my former employer is Python 2.) That would have been the perfect time to address startup time, because getting a 2x-10x improvement (which is what's needed) requires breaking changes.
I don't think there's a lack of interest in the broader Python community, but there might be a lack of interest/manpower in the core team, which leads to the situation wonderfully summarized in the recent xkcd:
FWIW I was the one who sent a patch to let Python run a .zip file back in 2007 or so, for Python 2.6 I think. This was roughly based on what we did at Google for self-contained applications. A core team member did a cleaner version of my patch, although this meant it was undocumented until Python 3.5 or so:
https://docs.python.org/3/library/zipapp.html
The .zip support at runtime was a start, but it's really the tooling that's a problem. And it's really the language that inhibits tooling.
Also, even if you distributed self-contained applications, the startup time is not great. It's improved a bit because you're "statting" a zip file rather than making syscalls, but it's still not great.
In other words, I have wondered about this "failure" for over a decade myself, and even tried to do something about it. I think the problem is that there are multiple parts to the solution, the responsibility for these parts is distributed. I hate to throw everything on the core team, but module systems and packaging are definitely a case where "distributed innovation" doesn't work. There has to be a central team setting standards that everyone else follows.
Also, it's not a trivial problem. Go is a static language and is doing better in this regard, but still people complain about packaging. (vgo is coming out after nearly a decade, etc.)
I should also add that while I think Python packaging is in the category of "barely works", I would say the same is true of Debian. And Debian is arguably the most popular Linux package manager. They're cases of "failure by success".
...
FWIW I think importing is heavily bottlenecked by I/O, in particular stat() of tons of "useless" files. In theory the C to Python change shouldn't have affected it much. But I haven't looked into it more deeply than that.
chubot 23 days ago [-]
EDIT: I should also add that the length of PYTHONPATH as constructed by many package managers is a huge problem. You're doing O(m*n) stat()s -- random disk access -- which is the slowest thing your computer can do.
m is the number of libraries you're importing, and n is the length of the PYTHONPATH.
So it gets really bad, and it's not just one person's "fault". It's a collusion between the Python interpreter's import logic and how package managers use it.
---
probably recursive descent with shunting yard subroutine for operator expression parsers (to handle precedence; pure recursive descent can't handle that [6], section 'The Operator Issue as of 1961')
instead of shunting yard, mb Pratt parsing (also called'precedence climbing') [7]
alternately, parse operator expressions as lists, then add associativity in postprocessing.
" In addition, while pure recursive descent cannot parse operator expressions, it can recognize them. This means pure recursive descent may not be able to create the parse subtree for an operator expression itself, but it can recognize the expression and hand control over to a specialized operator expression parser. This seems to be what Lucas' 1961 algorithm did, and it is certainly what many other implementations did afterwards. Adding the operator expression subparser makes the implementation only quasi-Chomskyan, but this was a price the profession has been willing to pay.
Alternatively, a recursive descent implementation can parse operator expressions as lists, and add associativity in post-processing. This pushes some of the more important parsing out of the syntactic phase into the semantics but, once again, it seemed that Chomskyan purity had to be thrown overboard if the ship was to stay afloat. " [8]
see also section https://jeffreykegler.github.io/personal/timeline_v3#h1-the_operator_issue_as_of_1968]
of course, the author of that suggests his parser, MARPA.
---
" Julia’s optimizer has gotten smarter in more ways than we can list here, but a few highlights are worth mentioning. The optimizer can now propagate constants through function calls, allowing much better dead-code elimination and static evaluation than before. The compiler is also much better at avoiding allocation of short-lived wrappers around long-lived objects, which frees programmers to use convenient high-level abstractions without performance costs. " [9]
---
"On some machines indirection is slower with displace- ment, so the most-used member of a structure or a record should be first. " -- Mike Morton Boston, Massachusetts
---
C compiler with support for structs written in Assembly (nongnu.org) https://bootstrapping.miraheze.org/wiki/Stage0 http://git.savannah.nongnu.org/cgit/stage0.git/tree/stage2/cc_x86.s http://git.savannah.nongnu.org/cgit/stage0.git/tree/ISA_HEX_Map.org https://news.ycombinator.com/item?id=17851311
mmastrac 1 day ago [-]
I started working on a similar VM-based bootstrap for getting from bare metal to C compiler. If anyone is interested in collaborating let me know.
The idea is that you implement a simple, ASCII-based VM for your platform and that's enough to bootstrap a number of different assemblers, to a basic C compiler, to a full-fledged C compiler and (very minimal) POSIX environment.
The goal is twofold: a base for trusted compilation like this one, and a way to guarantee long-term executability of various archiving programs (ie: guarantee that you can unrar a file in 20 years with minimal work).
EDIT: very rough repo is here - https://github.com/mmastrac/bootstrap
reply
giomasce 1 day ago [-]
I am also doing something like that here:
https://gitlab.com/giomasce/asmc
My philosophy is of having a very simple and custom language (which I called G) for which it is easier to write a compiler in (x86) Assembly, then write a C compiler in G that is good enough to compile tcc. Then tcc is known to be able to compile gcc 4.7.4 (the last version which does not require a C++ compiler).
My C compiler is starting to have a shape, and in theory (if I find the time to work on it) it should not be far from supporting all that tcc needs.
The README in the linked page contains more information.
I'll look at your code too!
reply
---
so the reference implementation will be self-hosted.
but perhaps we can also have a more performant implementation written in Rust?
---
recall that the current idea with Boot is:
make Boot the sorta-intersection of LLVM, WASM, RISCV -- but make it really easy to implement (eg no infinite registers, unlike LLVM). That way it should be dead simple to port Boot to any platform.
---
why not to target Rust, and target LLVM instead:
https://news.ycombinator.com/item?id=12148124
---
the CCL (a lisp) implementation can compile itself in about 30 seconds [10] . So i guess that's a good goal
---
why Rust abandoned segmented stacks: https://mail.mozilla.org/pipermail/rust-dev/2013-November/006314.html
this guy approves: http://dtrace.org/blogs/bmc/2018/09/18/falling-in-love-with-rust/
golang also abandoned segmented stacks (but still has resizable stacks):
" Golang first started with a segmented-stacks model where the stack would actually expand into a separate area of memory, using some clever bookkeeping to keep track. A later implementation improved performance in specific cases by instead using a contiguous stack where instead of splitting the stack, much like resizing a hashtable, a new large stack is allocated and through some very tricky pointer manipulation, all the contents are carefully copied into the new, larger, stack. [return] " [11]
---
Rust sort of abandoned M:N greenthreads:
this guy approves:
http://dtrace.org/blogs/bmc/2018/09/18/falling-in-love-with-rust/
note however that this limits concurrency, so in Oot, we probably do want greenthreads: https://rcoh.me/posts/why-you-can-have-a-million-go-routines-but-only-1000-java-threads/
---
some things that many languages (eg oot) probably should have eventually:
https://hacks.mozilla.org/2018/10/webassemblys-post-mvp-future/
---
" Unknown said...
"offset + x < length" is not a safe form for a bounds check, if x can be a value decoded from the network data. If x can have a value close to the maximum positive value for its type, then "offset + x" may overflow and compare less than length, even though x is not within bounds. Instead, compare x to length - offset, knowing that offset <= length is an invariant.
"
---
Ericson2314 1 day ago [-]
IMO most discussion of the compilation speed is incredibly narrow-minded. ((Value?)) incremental compilation more than anything else. Programs barely change each rebuild, so that should solve all the performance problems without having to maim ourselves.
reply
---
https://www.reddit.com/r/rust/comments/awx9cy/github_kyrenluster_an_experimental_lua_vm/ https://github.com/kyren/luster
" Inside luster are two libraries called "gc-arena" and "gc-sequence", and they represent a new (I believe novel?) system for safe garbage collection in Rust. There have been several attempts here before such as rust-gc and shifgrethor, and this represents another attempt with... different? limitations more appropriate for implementing language runtimes like Lua. "
---
" Major changes
The lockfile (and configuration) format will become a strict subset of YAML. In the process, the lockfile information will be slightly changed to account for some long-awaited changes (such as #5892).
We'll add support for plugins, which will be able to alter various things - from adding new commands to hooking into the resolution / fetching / linking steps to add support for new package sources or install targets.
Related to the plugin system, Yarn will become an API as much as a CLI. You can expect to be able to require it and start using its components in your script - no need to parse your package.json anymore, no need to run the resolution .. Yarn will abstract all those tedious tasks away.
Support for both Node 4 and Node 6 will be dropped. We don't expect Berry to have stable releases until after Node 6 gets EOL (expected April 2019), by which point it won't matter anymore.
The log system will be overhauled - one thing in particular we'll take from Typescript are diagnostic error codes. Each error, warning, and sometimes notice will be given a unique code that will be documented - with explanations to help you understand how to unblock yourself.
Some features currently in the core (such as autoclean) will be moved into contrib plugins. They'll still be supported, but might have a different release cycle than the standard bundle.
The codebase will be ported from Flow to TypeScript. To understand the rational please continue reading, but a quick summary is that we hope this will help our community ramp up on Yarn, and will help you build awesome new features on top of it.
The cache file format will switch from Tar to Zip, which offer better characteristics in terms of random access."
scrollaway 40 days ago [-]
Very happy to see yarn.lock will finally be a proper format that won't need its own parser. YAML subset is a pretty good choice, though I think canonicalized, indented JSON would be a better choice for this use case. Incidentally that's what npm uses as lockfile, I wonder if there's room to have the two package managers share the format (or even share the file itself).
Very excited to see shell compatibility guarantee in scripts as well. Using environment variables in scripts is a pain right now.
Finally one of the biggest news is the switch from Flow to Typescript. I think it's now clear that Facebook is admitting defeat with Flow; it brought a lot of good in the scene but Typescript is a lot more popular and gets overall much better support. Uniting the JS ecosystem around Typescript will be such a big deal.
.
wopian 40 days ago [-]
npm's lockfile is a pain to diff in PRs because of the JSON format where what was maybe 20 changed lines in yarn is upwards of 80 from the brackets.
With YAML and whatever format yarn.lock was in, the only changed lines are changes to the version resolutions, hash and dependencies.
donatj 40 days ago [-]
I'd say safely merging YAML diffs however could be trouble.
I don't know how restricted their YAML subset is, but in my experience it's so loose a format the only way to be sure YAML says what you think it says is to run it through a parser.
---
---
when doing register allocation/assignment (allocation means to decide which variables get to be in registers, assignment means actually decide which particular register each variable goes in), you hear alot about 'linear scan' as the simple/low-latency way, but this is stil a 'global' method; the top answer at [12] described a 'local' method (local to each basic block).
another slide presentation i found once said something about two obvious local methods, 'bottom up' and 'top down'. I think 'top down' just meant to identify all of the variables in a basic block, then assign priorities to each of them.
for Oot's reference implementation (involving OVM->BootX?->Boot) we clearly just want a simple local method, because if we cared about performance, we'd go from OVM directly into the native platform, skipping BootX? and Boot. In fact for BootX?->Boot we should just use 'top down' and say that registers thru r7 map into registers and the rest are in memory somewhere -- this is easy to program, fast to compile, and has predictable performance for BootX? programmers.
should copy this stuff into plbook after i learn a little more about it.
---
level 1 gasche 10 points · 6 years ago
How do performance compare to OcamlJit?