Bayle Shanks's website: proj-oot-old-150618-ootImplementationNotes2

how are we going to reconcile our requirement that the main interpreter/compiler be canonical with our requirement that it be small and simple?

if oot succeeds but it remains small and simple, it won't have many optimizations, and then ppl will switch to whatever other interpreters/compilers are made by others

maybe allow tons of optimizations, but only in the form of rewriting rules in separate pipeline stages between the main compiler stages? so you could just take out all of the optimization code if you wanted and the language would still work? so then the subset of the compiler with the optimizations removed is small and simple, but the whole thing is somewhat larger?

" The binary ended up a bit bigger than I’d like. Adding the GTK and OBus libraries in particular added a lot to the size (though they are optional). The main problem with GTK is that it has to be compiled as a plugin, because we don’t know if the target system will have libgtk. If we used a single binary and the library wasn’t present, it would refuse to start. By having it in a plugin we can try to load it, but fall back to console mode if that fails. However, compiling for plugins prevents OCaml from optimising the binary by removing unused library functions, since it doesn’t know which ones might be needed by the plugins.

The binary is compiled with debug symbols, but compiling without has almost no effect (binary becomes 1.5% smaller). "

" The ocamlbuild command automatically discovers dependencies and rebuilds only what is needed, so incremental builds are usually fast and are generally reliable (the exception is that it doesn’t notice if you remove or rename a file, but you always get an error message in that case rather than an incorrect build).

Most errors are picked up by the type checker immediately at the start of the build, rather than by the unit-tests at the end. That saves a lot of time. "

" Two things did speed it up slightly: building the tests and the main binary with a single invocation (saves having to run the dependency checker twice) and turning on parallel builds. Parallel builds didn’t help as much as I’d hoped however.

Update: edwintorok profiled the build and noticed that 25.5% of the time is spent running a bytecode version of the camlp4 pre-processor (which we use for the Lwt syntax extension and for conditional compilation) and 10.5% is spent on a bytecode version of ocamlfind (looks like an ocamlbuild bug). Why ocamlbuild’s parallelization is often disappointing today looks interesting too. "

There are some changes (module aliases) coming in OCaml 4.02 which should help. Currently, if I change one of the files in the Support module (e.g. Support.Sat) then it first rebuilds Sat, then rebuilds Support with the new Sat module, then rebuilds everything that uses Support (which is everything). In reality, it only needs to rebuild Zeroinstall.Solver when Sat changes. "

" If you do need to modify one of the early modules and run the unit tests quickly, a good trick is to compile to byte-code rather than to native. The byte-code compiler doesn’t do cross-module inlining optimisations, which means that as long as a module’s interface doesn’t change, it doesn’t need to recompile the things that depend on it. "

remy says: November 16, 2008 at 7:19 pm

Hi,

I had the same idea, you should have a look to Phil Bagwell “Fast Functional Lists” Since J. McCarthy? first introduced Functional Programming, the Linked List has almost universally been used as the underpinning data structure. This paper introduces a new data structure, the VList, that is compact, thread safe and significantly faster to use than Linked Lists for nearly all list operations. Space usage can be reduced by 50% to 90% and in typical list operations speed improved by factors ranging from 4 to 20 or more. Some important operations such as indexing and length are typically changed from O(N) to O(1) and O(lgN) respectively. In the current form the VList structure can provide an alternative heap architecture for functional languages using eager evaluation. To prove the viability of the new structure a language interpreter Visp, a dialect of Common Lisp, has been implemented using VList and a simple benchmark comparison with OCAML reported. Reply

don't make these mistakes with the implementation of generics:

http://nomothetis.svbtle.com/smashing-swift

three issues:

no generic functions in protocols (but see: https://news.ycombinator.com/item?id=7896161 ) (see also http://schani.wordpress.com/2014/06/11/associated-types-considered-weird/ and https://groups.google.com/forum/#!topic/swift-language/3PtydRXR0ao)
generic enums theoretically exist but 'not yet implmented' in compiler
assignment of an empty list constant as a default value to a generic parameter caused a segfault in the compiler

the nimrod project thinks that bootstrapping with an LLVM backend is much harder than with a direct-to-C backend

http://forum.nimrod-lang.org/t/480

bkeroack 1 day ago

link

Depends on how you want to program: imperative vs functional.

Personally I think list comprehensions are the most beautiful part of Python, though sometimes I use map() when I'm trying to be explicitly functional (I realize it's allegedly slower, etc).

Generally I think list comprehensions are cleaner and allow you to write purer functions with fewer mutable variables. I disagree that deeply nested for loops are necessarily more readable.

SEJeff 1 day ago

link

Map is not allegedly slower, it is demonstrably slower.

$ python -mtimeit -s'nums=range(10)' 'map(lambda i: i + 3, nums)' 1000000 loops, best of 3: 1.61 usec per loop

$ python -mtimeit -s'nums=range(10)' '[i + 3 for i in nums]' 1000000 loops, best of 3: 0.722 usec per loop

Function calls have overhead in python, list comprehensions are implemented knowing this fact and avoiding it so the heavy lifting ultimately happens in C code.

---

if the Oot version is an integer in a fixed-length field in some binary version (as it is in Lua, which uses one byte), how long should it be? One byte seems way too few to me..

$ git clone git:git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

$ git clone git:git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Cloning into 'linux'... remote: Counting objects: 3662322, done. remote: Compressing objects: 100% (548398/548398), done. Receiving objects: 31% (1167486/3662322), 398.65 MiB?

remote: Total 3662322 (delta 3085574), reused 3660773 (delta 3084236) Receiving objects: 100% (3662322/3662322), 770.73 MiB?Resolving deltas: 100% (3085574/3085574), done.

131 KiB?/s

178 KiB?/s, done.

$ cd linux/ $ git rev-list --all

456474 # all commits $ git log --pretty=oneline 456474 # all commits on current branch

wc -l

so even 16 bits would not be enough to contain each commit. But how many releases are there?

$ git tag -l

380

wc -l

So 16 bits would be more than enough to contain every release. Which makes sense; even if there were one release a day (!), 65536 releases would take 179.55 years.

So 16 bits.

---

Lua has an 'endianness flag' in its binary headers

it also has:

Size of int (in bytes) (default 4) Size of size_t (in bytes) (default 4) Size of Instruction (in bytes) (default 4) Size of lua_Number (in bytes) (default 8) Integral flag (default 0) • 0=floating-point, 1=integral number type

" A Lua 5.1 binary chunk header is always 12 bytes in size. Since the characteristics of a Lua virtual machine is hard-coded, the Lua undump code checks all 12 of the header bytes to determine whether the binary chunk is fit for consumption or not. All 12 header bytes of the binary chunk must exactly match the header bytes of the platform, otherwise Lua 5.1 will refuse to load the chunk. The header is also not affecte d by endianness; the same code can be used to load the main header of little-endian or big-endia n binary chunks. The data type of lua_Number is determined by the size of lua_Number byte and the integral flag together. In theory, a Lua binary chunk is portable; in real life, th ere is no need for the undump code to support such a feature. ...

Anyway, most of the time there is little need to ser iously scrutinize the header, because since Lua source code is usually available, a chunk can be readily c ompiled into the native binary chunk format "

in Lua, the binary function header chunks are of variable size only due to the 'source name' field, which, for the 'top-level' function of each file, contains the file name, i guess for debugging. maybe in oot it would be better to have these be of fixed size, and to standardize the length of 'source name', and to remove it for stripped files. otoh if you do that then it becomes difficult to have that scripting language magic function which tells a bit of code which filepath it is in.

in binary chunks, Lua strings have a length preceding them (of size size_t, which on x86 is 32 bits), AND a null byte at the end. i like that.

in binary chunks, Lua lists have a length preceding them, of size size_t, which on x86 is 32 bits. The instructions are themselves in such a list. So can't have more than 4gi (4 gigs) of instructions in one function.

https://github.com/akeep/nanopass-framework

http://www.erlang-factory.com/upload/presentations/708/HitchhikersTouroftheBEAM.pdf

---

hlvm and vmkit and llvm:

http://www.archivum.info/comp.lang.functional/2009-01/00041/Re-HLVM.html

---

http://blog.omega-prime.co.uk/?p=135

---

http://akkartik.name/post/wart-layers has the idea of a program built out of (an ordered sequence of) layers which each include their own tests; (the sum of a prefix of) layers are like versions, and the result of stacking any prefix of layers is a program which passes its own tests.

(i guess these layers are not quite like chronological versions because you can make changes to an underlying layer after you have added a new layer on top of it)

---

llvm tail calls literally just reuse stack frame and so require caller callee identical signatures (prototypes)

llvm cannot do tail calls if 'invoke' is used (which is 'call' but with exception handling) b/c it cant bubble up exceptions thru discarded stack frames

this suggests that llvm 'call' w/o delimited continuations etc is a useful special case that could be identified in annotaions of oot AST because it would allow lower levels of the compiler to optimize it to platform primitives (such as LLVM CALL, and many assembly languages CALL/RET, whcih GHC Haskell apparently doesn't use due to the need to return to after an 'info table' instead of immediately after CALL instruction)

there is or used to be some sort of problem in Haskell with exceptions and the type system that can cause a compiler crash:

http://existentialtype.wordpress.com/2012/08/14/haskell-is-exceptionally-unsafe/

---

consider using SEAM:

https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf chapter 2

---

For implementing Python, you have at least five options:

Naive interpreter (CPython). Everything is a dict. Slow. - Transliterate to some hard-compiled language, but all data is still one kind of dynamically typed object (Nuitka). A little faster, and compatible, but has limited optimization potential. - Infer types and try to create appropriate code in a faster language (Shed Skin). Hard to do, but promising. (Shed Skin has one implementor.) - Restrict the language (RPython) Potentially much faster, but incompatible. - Build a JIT compiler/interpreter combo and handle all the hard cases that require recompiling during execution (PyPy?). Hard to do, and results in a huge system, but almost compatible. After 10 years of work, it's finally happening.

If you're willing to restrict the language, it's much easier. RPython was written only to help build PyPy?, but the concept could be extended to allow most of Python. Both Shed Skin and RPython insist that type inference succeed at disambiguating types. If you're willing to accept using an "any" type when type inference fails, you can handle more of the language.

The big boat-anchor feature of Python is "setattr", combined with the ability to examine and change most of the program and its objects. This isn't just reflection; it's insinuation; you can get at things which should be private to other objects or threads and mess with them. By string name, no less. This invalidates almost all compiler optimizations. It's not a particularly useful feature. It just happens to be easy to implement given CPython's internal dictionary-based model. If "setattr" were limited to a class of objects derived from a "dynamic object" class, most of the real use cases (like HTML/XML parsers where the tags appear as Python attributes) would still work, while the rest of the code could be compiled with hard offsets for object fields.

The other big problem with Python is that its threading model is no better than C's. Everything is implicitly shared. To make this work, the infamous Global Interpreter Lock is needed. Some implementations split this into many locks, but because the language has no idea of which thread owns what, there's lots of unnecessary locking.

Python is a very pleasant language in which to program. If it ever gets rid of the less useful parts of the "extremely dynamic semantics", sometimes called the Guido von Rossum Memorial Boat Anchor, it could be much more widely useful.

halflings 20 hours ago

link

> The big boat-anchor feature of Python is "setattr", combined with the ability to examine and change most of the program and its objects. [...] If "setattr" were limited to a class of objects derived from a "dynamic object" class, most of the real use cases [...] would still work, while the rest of the code could be compiled with hard offsets for object fields.

Isn't __slots__ made for that specific usecase? (when you want to optimize your code by specifiying specific attribute names)

I think the "dynamic" behavior is a sane default (since most people don't need that optimization).

comex 19 hours ago

link

setattr and similar dynamic features do make Python harder to optimize, but they're not that different from what you see in JavaScript?. JavaScript? has had an incredible amount of work spent optimizing it, but the result is a bunch of pretty damn fast language JITs that implement the full language, usually only slowing down in cases where actual use of those features requires it. Is it really that hard to come up with something like that for Python?

(Threading is a separate issue, though.)

---

http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup

(i wrote that question)

as of this writing, the only answer is:

"Python has a large standard library but only 40 modules (CPython 3.4 REPL on Windows) of it are loaded upon startup (the minimum to set up everything and start running the user program), and a number of those are built-in, i.e. C code compiled into the main binary, which saves on file I/O. In addition, startup time has long been a concern of several core developers, and consequently has been optimized quite a bit."

that makes sense. Another hypothesis i have is:

The parts of Java's stdlib that it needs upon startup are in the rt.jar file. There might be special optimizations applied to this, e.g. class data sharing, which might speed up the Java 'hello world' benchmarks (on Windows and OSX, where it is apparently default: https://github.com/jruby/jruby/wiki/Improving-startup-time ). However, when loading eg Clojure, JRuby, Kawa, Jyython or Scala's stdlib, these optimizations are not available. So, perhaps class file loading and initialization is slow, but this slowness is not seen in the Java rt.jar stdlib. Why is class file loading slow? Well, as http://stackoverflow.com/questions/844143/why-is-the-jvm-slow-to-start#comment652767_844174 and https://en.wikipedia.org/wiki/Java_performance#cite_note-jrecache-27 say, first, having to load a lot of different classfiles, from archives no less, is probably suboptimal. Second, as http://stackoverflow.com/questions/15914094/can-any-clojure-implementation-start-fast#comment27116634_15915993 says " "One limitation of the JVM is that objects must be copied on initialization, "You can't embed any composite constants in byte code. Not even arrays.""; initialization is slow if you have to copy objects upon initialization. Even without special treatment, presumably dynamic languages like Clojure have even more initialization to do than Java, anyways.

So, i guess the design choices here are:

(1) neglecting to provide a way to embed composite constants in bytecode (a choice which afaict has no benefit except implementations simplicity) (2) having an entire dynamic language initialization and core library code written purely in bytecode, as opposed to having crucial parts of it closer-to-the-metal, e.g. the C parts of Python

my suggested remedies are: 1) design bytecode so that composite constants can be embedded -- note that if the bytecode is a standardized interface, this implies making a serialization or 'pickle' format part of the standard 2) provide a way for the constant parts of external libraries to be loaded into memory via loading them once 'the hard way', saving an image of that part of memory, and then mmapping that image. I'm not sure if this should be done at installation time (on the target machine), or at link time (so that these images would be distributed as binaries). I'm guessing installation time. Note that an external library might also want to do some sort of varying initialization procedure each time it starts up (check the machine's IP address, or the current working directory, or the time, or something like that), so it should still get a crack of that after loading and mmapping the stored memory image. I'm not sure how easy it is to image out and mmap in just part of a processes's memory on various target architectures; would the language implementation have to malloc a separate block of memory for each library? If so, is this a problem? 3) make sure it is easy for external libraries to use native code, so that something like Clojure written on top of your language can, if they choose, provide a fast native implementation

(i just added paraphrases of these to whyOot)

---

here's my current guess as to the answer to my question:

http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup/28379292

---

" User interface

Swing has been perceived as slower than native widget toolkits, because it delegates the rendering of widgets to the pure Java 2D API. However, benchmarks comparing the performance of Swing versus the Standard Widget Toolkit, which delegates the rendering to the native GUI libraries of the operating system, show no clear winner, and the results greatly depend on the context and the environments.[70]

In most cases Java suffers greatly from its need to copy image data from one place in memory to another before rendering it to the screen. C++ can usually avoid this large overhead by accessing memory directly. The developers of Java have attempted to overcome this limitation with certain so-called "unsafe" direct memory access classes. However, those attempts fall far short of what C++ natively offers. For example, two major Java OpenGL? implementations suffer tremendously from this data duplication problem which is difficult, if not impossible, to avoid with Java.[citation needed] " -- https://en.wikipedia.org/wiki/Java_performance

---

some stuff i took out of http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup before posting it:

If this were the case, why don't JVM implementors simply create a serialization/snapshot of the in-memory representation of the read-in standard library files, and distribute this as part of the JVM executable, rather than loading separate classfiles? ([17][17], [18][18], [19][19]). In fact, maybe they did; this sounds like the JVM's Class Data Sharing ([20][20], [21][21]). But apparently this mechanism only applies to the classes in the Java standard library, and cannot be used to speed up, for example, the Clojure standard libary ([27][27]).

[17]: http://stackoverflow.com/questions/844143/why-is-the-jvm-slow-to-start#comment3973994_844174 [18]: http://nicholaskariniemi.github.io/2014/02/25/clojure-bootstrapping.html [19]: http://nicholaskariniemi.github.io/2014/03/19/solving-clojure-boot-time.html#comment-1715285215 [20]: https://en.wikipedia.org/wiki/Java_performance#Class_data_sharing [27]: http://nicholaskariniemi.github.io/2014/03/19/solving-clojure-boot-time.html#comment-1715285215

more on that soln: https://news.ycombinator.com/item?id=4222977 http://stackoverflow.com/questions/4692076/speed-up-application-start-by-adding-own-application-classes-to-classes-jsa

more startup time benchmarks: https://news.ycombinator.com/item?id=4222900 https://news.ycombinator.com/item?id=2698579

also, benchmarks from my system:

$ time python -c 'print "hello world"' hello world

real 0m0.030s user 0m0.020s sys 0m0.011s

$ time bash -c 'echo hello world' hello world

real 0m0.007s user 0m0.001s sys 0m0.005s

$ time perl -e 'print "hello world"' hello world real 0m0.007s user 0m0.000s sys 0m0.009s

---

http://www.joelonsoftware.com/articles/fog0000000319.html suggests always mallocing in roundup-to-the-neareest-pwr-of-2 amounts, and always reallocing to 2x what the previous size was.

---

static linking and Go: https://news.ycombinator.com/item?id=4195176

---

sometimes if an associative array is small, it may be more efficient to implement it as a list:

https://news.ycombinator.com/item?id=9162458

---

kentonv 3 days ago

Ruby (or maybe specifically Gems) startup is still quadratic in another -- IMO worse -- way: Every require() (that misses cache) searches for the requested file in _every_ dependency package until found. It seems that instead of having the file name itself include the precise package name, Ruby instead treats the packages as a search path that unions a bunch of separate trees into a single logical file tree.

The result is that Ruby does O(n^2) stat() calls at startup, where n is the number of dependency packages. Most of these operations are for files that don't exist, which means they'll miss the filesystem cache and perform absolutely atrociously on top of FUSE, NFS, etc. A typical Ruby app sitting on a loopback FUSE device (i.e. one just mirroring the local filesystem, no network involved) will take minutes to start up, even with aggressive caching enabled. :(

---

didip 8 hours ago

Question for Rust peeps out there:

Is there a documentation on how to build single binary executable for Rust?

steveklabnik 8 hours ago

I'm assuming you're referring to static linking.

Right now, we depend on glibc, and it cannot be statically linked. By default, Rust applications statically link everything but glibc already.

At least one core team member is strongly advocating for musl support for truly standalone binaries in the 1.1 timeframe, but nothing is likely to change for 1.0, and it's too early to decide on 1.1 features yet :)

xyzzy_plugh 8 hours ago

  > Right now, we depend on glibc, and it cannot be statically linked.

Could you expand on this? Why can it not be statically linked?

spystath 8 hours ago

Name resolution libraries are loaded dynamically [1,2], so glibc can't be linked statically if you want NSS to work. I gather there might be a way to explicitly link to a specific resolution library but this is practically unsupported/poorly documented.

[1]: https://sourceware.org/glibc/wiki/FAQ#Even_statically_linked...

[2]: http://stackoverflow.com/questions/3430400/linux-static-link...

xyzzy_plugh 8 hours ago

I understand this limitation. I can currently get around this by using --enable-static-nss.

Does rust currently expose any way to do this? Could I perform linking myself as a workaround?

steveklabnik 8 hours ago

> Could I perform linking myself as a workaround?

You can but it requires lots of black magic, and certainly a nightly rather than beta Rust ;)

http://mainisusuallyafunction.blogspot.com/2015/01/151-byte-... is an old blog post that can get you started.

---

good read:

Python without an operating system http://lwn.net/Articles/641244/

---

here's various options that Nim has that let you override various things to make your binary size super small (possibly useful for embedded systems):

http://hookrace.net/blog/nim-binary-size/

oot should probably support many of these options

---

this comment suggests that the Erlang VM is better at latency than even Go: https://news.ycombinator.com/item?id=9670133

---