Bayle Shanks's website: proj-oot-ootImplementationNotes2

how are we going to reconcile our requirement that the main interpreter/compiler be canonical with our requirement that it be small and simple?

if oot succeeds but it remains small and simple, it won't have many optimizations, and then ppl will switch to whatever other interpreters/compilers are made by others

maybe allow tons of optimizations, but only in the form of rewriting rules in separate pipeline stages between the main compiler stages? so you could just take out all of the optimization code if you wanted and the language would still work? so then the subset of the compiler with the optimizations removed is small and simple, but the whole thing is somewhat larger?

" The binary ended up a bit bigger than I’d like. Adding the GTK and OBus libraries in particular added a lot to the size (though they are optional). The main problem with GTK is that it has to be compiled as a plugin, because we don’t know if the target system will have libgtk. If we used a single binary and the library wasn’t present, it would refuse to start. By having it in a plugin we can try to load it, but fall back to console mode if that fails. However, compiling for plugins prevents OCaml from optimising the binary by removing unused library functions, since it doesn’t know which ones might be needed by the plugins.

The binary is compiled with debug symbols, but compiling without has almost no effect (binary becomes 1.5% smaller). "

" The ocamlbuild command automatically discovers dependencies and rebuilds only what is needed, so incremental builds are usually fast and are generally reliable (the exception is that it doesn’t notice if you remove or rename a file, but you always get an error message in that case rather than an incorrect build).

Most errors are picked up by the type checker immediately at the start of the build, rather than by the unit-tests at the end. That saves a lot of time. "

" Two things did speed it up slightly: building the tests and the main binary with a single invocation (saves having to run the dependency checker twice) and turning on parallel builds. Parallel builds didn’t help as much as I’d hoped however.

Update: edwintorok profiled the build and noticed that 25.5% of the time is spent running a bytecode version of the camlp4 pre-processor (which we use for the Lwt syntax extension and for conditional compilation) and 10.5% is spent on a bytecode version of ocamlfind (looks like an ocamlbuild bug). Why ocamlbuild’s parallelization is often disappointing today looks interesting too. "

There are some changes (module aliases) coming in OCaml 4.02 which should help. Currently, if I change one of the files in the Support module (e.g. Support.Sat) then it first rebuilds Sat, then rebuilds Support with the new Sat module, then rebuilds everything that uses Support (which is everything). In reality, it only needs to rebuild Zeroinstall.Solver when Sat changes. "

" If you do need to modify one of the early modules and run the unit tests quickly, a good trick is to compile to byte-code rather than to native. The byte-code compiler doesn’t do cross-module inlining optimisations, which means that as long as a module’s interface doesn’t change, it doesn’t need to recompile the things that depend on it. "

remy says: November 16, 2008 at 7:19 pm

Hi,

I had the same idea, you should have a look to Phil Bagwell “Fast Functional Lists” Since J. McCarthy? first introduced Functional Programming, the Linked List has almost universally been used as the underpinning data structure. This paper introduces a new data structure, the VList, that is compact, thread safe and significantly faster to use than Linked Lists for nearly all list operations. Space usage can be reduced by 50% to 90% and in typical list operations speed improved by factors ranging from 4 to 20 or more. Some important operations such as indexing and length are typically changed from O(N) to O(1) and O(lgN) respectively. In the current form the VList structure can provide an alternative heap architecture for functional languages using eager evaluation. To prove the viability of the new structure a language interpreter Visp, a dialect of Common Lisp, has been implemented using VList and a simple benchmark comparison with OCAML reported. Reply

don't make these mistakes with the implementation of generics:

http://nomothetis.svbtle.com/smashing-swift

three issues:

no generic functions in protocols (but see: https://news.ycombinator.com/item?id=7896161 ) (see also http://schani.wordpress.com/2014/06/11/associated-types-considered-weird/ and https://groups.google.com/forum/#!topic/swift-language/3PtydRXR0ao)
generic enums theoretically exist but 'not yet implmented' in compiler
assignment of an empty list constant as a default value to a generic parameter caused a segfault in the compiler

the nimrod project thinks that bootstrapping with an LLVM backend is much harder than with a direct-to-C backend

http://forum.nimrod-lang.org/t/480

bkeroack 1 day ago

link

Depends on how you want to program: imperative vs functional.

Personally I think list comprehensions are the most beautiful part of Python, though sometimes I use map() when I'm trying to be explicitly functional (I realize it's allegedly slower, etc).

Generally I think list comprehensions are cleaner and allow you to write purer functions with fewer mutable variables. I disagree that deeply nested for loops are necessarily more readable.

SEJeff 1 day ago

link

Map is not allegedly slower, it is demonstrably slower.

$ python -mtimeit -s'nums=range(10)' 'map(lambda i: i + 3, nums)' 1000000 loops, best of 3: 1.61 usec per loop

$ python -mtimeit -s'nums=range(10)' '[i + 3 for i in nums]' 1000000 loops, best of 3: 0.722 usec per loop

Function calls have overhead in python, list comprehensions are implemented knowing this fact and avoiding it so the heavy lifting ultimately happens in C code.

---

if the Oot version is an integer in a fixed-length field in some binary version (as it is in Lua, which uses one byte), how long should it be? One byte seems way too few to me..

$ git clone git:git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

$ git clone git:git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Cloning into 'linux'... remote: Counting objects: 3662322, done. remote: Compressing objects: 100% (548398/548398), done. Receiving objects: 31% (1167486/3662322), 398.65 MiB?

remote: Total 3662322 (delta 3085574), reused 3660773 (delta 3084236) Receiving objects: 100% (3662322/3662322), 770.73 MiB?Resolving deltas: 100% (3085574/3085574), done.

131 KiB?/s

178 KiB?/s, done.

$ cd linux/ $ git rev-list --all

456474 # all commits $ git log --pretty=oneline 456474 # all commits on current branch

wc -l

so even 16 bits would not be enough to contain each commit. But how many releases are there?

$ git tag -l

380

wc -l

So 16 bits would be more than enough to contain every release. Which makes sense; even if there were one release a day (!), 65536 releases would take 179.55 years.

So 16 bits.

---

Lua has an 'endianness flag' in its binary headers

it also has:

Size of int (in bytes) (default 4) Size of size_t (in bytes) (default 4) Size of Instruction (in bytes) (default 4) Size of lua_Number (in bytes) (default 8) Integral flag (default 0) • 0=floating-point, 1=integral number type

" A Lua 5.1 binary chunk header is always 12 bytes in size. Since the characteristics of a Lua virtual machine is hard-coded, the Lua undump code checks all 12 of the header bytes to determine whether the binary chunk is fit for consumption or not. All 12 header bytes of the binary chunk must exactly match the header bytes of the platform, otherwise Lua 5.1 will refuse to load the chunk. The header is also not affecte d by endianness; the same code can be used to load the main header of little-endian or big-endia n binary chunks. The data type of lua_Number is determined by the size of lua_Number byte and the integral flag together. In theory, a Lua binary chunk is portable; in real life, th ere is no need for the undump code to support such a feature. ...

Anyway, most of the time there is little need to ser iously scrutinize the header, because since Lua source code is usually available, a chunk can be readily c ompiled into the native binary chunk format "

in Lua, the binary function header chunks are of variable size only due to the 'source name' field, which, for the 'top-level' function of each file, contains the file name, i guess for debugging. maybe in oot it would be better to have these be of fixed size, and to standardize the length of 'source name', and to remove it for stripped files. otoh if you do that then it becomes difficult to have that scripting language magic function which tells a bit of code which filepath it is in.

in binary chunks, Lua strings have a length preceding them (of size size_t, which on x86 is 32 bits), AND a null byte at the end. i like that.

in binary chunks, Lua lists have a length preceding them, of size size_t, which on x86 is 32 bits. The instructions are themselves in such a list. So can't have more than 4gi (4 gigs) of instructions in one function.

https://github.com/akeep/nanopass-framework

http://www.erlang-factory.com/upload/presentations/708/HitchhikersTouroftheBEAM.pdf

---

hlvm and vmkit and llvm:

http://www.archivum.info/comp.lang.functional/2009-01/00041/Re-HLVM.html

---

http://blog.omega-prime.co.uk/?p=135

---

http://akkartik.name/post/wart-layers has the idea of a program built out of (an ordered sequence of) layers which each include their own tests; (the sum of a prefix of) layers are like versions, and the result of stacking any prefix of layers is a program which passes its own tests.

(i guess these layers are not quite like chronological versions because you can make changes to an underlying layer after you have added a new layer on top of it)

---

llvm tail calls literally just reuse stack frame and so require caller callee identical signatures (prototypes)

llvm cannot do tail calls if 'invoke' is used (which is 'call' but with exception handling) b/c it cant bubble up exceptions thru discarded stack frames

this suggests that llvm 'call' w/o delimited continuations etc is a useful special case that could be identified in annotaions of oot AST because it would allow lower levels of the compiler to optimize it to platform primitives (such as LLVM CALL, and many assembly languages CALL/RET, which GHC Haskell apparently doesn't use due to the need to return to after an 'info table' instead of immediately after CALL instruction)

there is or used to be some sort of problem in Haskell with exceptions and the type system that can cause a compiler crash:

http://existentialtype.wordpress.com/2012/08/14/haskell-is-exceptionally-unsafe/

---

consider using SEAM:

https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf chapter 2

---

For implementing Python, you have at least five options:

Naive interpreter (CPython). Everything is a dict. Slow. - Transliterate to some hard-compiled language, but all data is still one kind of dynamically typed object (Nuitka). A little faster, and compatible, but has limited optimization potential. - Infer types and try to create appropriate code in a faster language (Shed Skin). Hard to do, but promising. (Shed Skin has one implementor.) - Restrict the language (RPython) Potentially much faster, but incompatible. - Build a JIT compiler/interpreter combo and handle all the hard cases that require recompiling during execution (PyPy?). Hard to do, and results in a huge system, but almost compatible. After 10 years of work, it's finally happening.

If you're willing to restrict the language, it's much easier. RPython was written only to help build PyPy?, but the concept could be extended to allow most of Python. Both Shed Skin and RPython insist that type inference succeed at disambiguating types. If you're willing to accept using an "any" type when type inference fails, you can handle more of the language.

The big boat-anchor feature of Python is "setattr", combined with the ability to examine and change most of the program and its objects. This isn't just reflection; it's insinuation; you can get at things which should be private to other objects or threads and mess with them. By string name, no less. This invalidates almost all compiler optimizations. It's not a particularly useful feature. It just happens to be easy to implement given CPython's internal dictionary-based model. If "setattr" were limited to a class of objects derived from a "dynamic object" class, most of the real use cases (like HTML/XML parsers where the tags appear as Python attributes) would still work, while the rest of the code could be compiled with hard offsets for object fields.

The other big problem with Python is that its threading model is no better than C's. Everything is implicitly shared. To make this work, the infamous Global Interpreter Lock is needed. Some implementations split this into many locks, but because the language has no idea of which thread owns what, there's lots of unnecessary locking.

Python is a very pleasant language in which to program. If it ever gets rid of the less useful parts of the "extremely dynamic semantics", sometimes called the Guido von Rossum Memorial Boat Anchor, it could be much more widely useful.

halflings 20 hours ago

link

> The big boat-anchor feature of Python is "setattr", combined with the ability to examine and change most of the program and its objects. [...] If "setattr" were limited to a class of objects derived from a "dynamic object" class, most of the real use cases [...] would still work, while the rest of the code could be compiled with hard offsets for object fields.

Isn't __slots__ made for that specific usecase? (when you want to optimize your code by specifiying specific attribute names)

I think the "dynamic" behavior is a sane default (since most people don't need that optimization).

comex 19 hours ago

link

setattr and similar dynamic features do make Python harder to optimize, but they're not that different from what you see in JavaScript?. JavaScript? has had an incredible amount of work spent optimizing it, but the result is a bunch of pretty damn fast language JITs that implement the full language, usually only slowing down in cases where actual use of those features requires it. Is it really that hard to come up with something like that for Python?

(Threading is a separate issue, though.)

---

http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup

(i wrote that question)

as of this writing, the only answer is:

"Python has a large standard library but only 40 modules (CPython 3.4 REPL on Windows) of it are loaded upon startup (the minimum to set up everything and start running the user program), and a number of those are built-in, i.e. C code compiled into the main binary, which saves on file I/O. In addition, startup time has long been a concern of several core developers, and consequently has been optimized quite a bit."

that makes sense. Another hypothesis i have is:

The parts of Java's stdlib that it needs upon startup are in the rt.jar file. There might be special optimizations applied to this, e.g. class data sharing, which might speed up the Java 'hello world' benchmarks (on Windows and OSX, where it is apparently default: https://github.com/jruby/jruby/wiki/Improving-startup-time ). However, when loading eg Clojure, JRuby, Kawa, Jyython or Scala's stdlib, these optimizations are not available. So, perhaps class file loading and initialization is slow, but this slowness is not seen in the Java rt.jar stdlib. Why is class file loading slow? Well, as http://stackoverflow.com/questions/844143/why-is-the-jvm-slow-to-start#comment652767_844174 and https://en.wikipedia.org/wiki/Java_performance#cite_note-jrecache-27 say, first, having to load a lot of different classfiles, from archives no less, is probably suboptimal. Second, as http://stackoverflow.com/questions/15914094/can-any-clojure-implementation-start-fast#comment27116634_15915993 says " "One limitation of the JVM is that objects must be copied on initialization, "You can't embed any composite constants in byte code. Not even arrays.""; initialization is slow if you have to copy objects upon initialization. Even without special treatment, presumably dynamic languages like Clojure have even more initialization to do than Java, anyways.

So, i guess the design choices here are:

(1) neglecting to provide a way to embed composite constants in bytecode (a choice which afaict has no benefit except implementations simplicity) (2) having an entire dynamic language initialization and core library code written purely in bytecode, as opposed to having crucial parts of it closer-to-the-metal, e.g. the C parts of Python

my suggested remedies are: 1) design bytecode so that composite constants can be embedded -- note that if the bytecode is a standardized interface, this implies making a serialization or 'pickle' format part of the standard 2) provide a way for the constant parts of external libraries to be loaded into memory via loading them once 'the hard way', saving an image of that part of memory, and then mmapping that image. I'm not sure if this should be done at installation time (on the target machine), or at link time (so that these images would be distributed as binaries). I'm guessing installation time. Note that an external library might also want to do some sort of varying initialization procedure each time it starts up (check the machine's IP address, or the current working directory, or the time, or something like that), so it should still get a crack of that after loading and mmapping the stored memory image. I'm not sure how easy it is to image out and mmap in just part of a processes's memory on various target architectures; would the language implementation have to malloc a separate block of memory for each library? If so, is this a problem? 3) make sure it is easy for external libraries to use native code, so that something like Clojure written on top of your language can, if they choose, provide a fast native implementation

(i just added paraphrases of these to whyOot)

---

here's my current guess as to the answer to my question:

http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup/28379292

---

" User interface

Swing has been perceived as slower than native widget toolkits, because it delegates the rendering of widgets to the pure Java 2D API. However, benchmarks comparing the performance of Swing versus the Standard Widget Toolkit, which delegates the rendering to the native GUI libraries of the operating system, show no clear winner, and the results greatly depend on the context and the environments.[70]

In most cases Java suffers greatly from its need to copy image data from one place in memory to another before rendering it to the screen. C++ can usually avoid this large overhead by accessing memory directly. The developers of Java have attempted to overcome this limitation with certain so-called "unsafe" direct memory access classes. However, those attempts fall far short of what C++ natively offers. For example, two major Java OpenGL? implementations suffer tremendously from this data duplication problem which is difficult, if not impossible, to avoid with Java.[citation needed] " -- https://en.wikipedia.org/wiki/Java_performance

---

some stuff i took out of http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup before posting it:

If this were the case, why don't JVM implementors simply create a serialization/snapshot of the in-memory representation of the read-in standard library files, and distribute this as part of the JVM executable, rather than loading separate classfiles? ([17][17], [18][18], [19][19]). In fact, maybe they did; this sounds like the JVM's Class Data Sharing ([20][20], [21][21]). But apparently this mechanism only applies to the classes in the Java standard library, and cannot be used to speed up, for example, the Clojure standard libary ([27][27]).

[17]: http://stackoverflow.com/questions/844143/why-is-the-jvm-slow-to-start#comment3973994_844174 [18]: http://nicholaskariniemi.github.io/2014/02/25/clojure-bootstrapping.html [19]: http://nicholaskariniemi.github.io/2014/03/19/solving-clojure-boot-time.html#comment-1715285215 [20]: https://en.wikipedia.org/wiki/Java_performance#Class_data_sharing [27]: http://nicholaskariniemi.github.io/2014/03/19/solving-clojure-boot-time.html#comment-1715285215

more on that soln: https://news.ycombinator.com/item?id=4222977 http://stackoverflow.com/questions/4692076/speed-up-application-start-by-adding-own-application-classes-to-classes-jsa

more startup time benchmarks: https://news.ycombinator.com/item?id=4222900 https://news.ycombinator.com/item?id=2698579

also, benchmarks from my system:

$ time python -c 'print "hello world"' hello world

real 0m0.030s user 0m0.020s sys 0m0.011s

$ time bash -c 'echo hello world' hello world

real 0m0.007s user 0m0.001s sys 0m0.005s

$ time perl -e 'print "hello world"' hello world real 0m0.007s user 0m0.000s sys 0m0.009s

---

http://www.joelonsoftware.com/articles/fog0000000319.html suggests always mallocing in roundup-to-the-neareest-pwr-of-2 amounts, and always reallocing to 2x what the previous size was.

---

static linking and Go: https://news.ycombinator.com/item?id=4195176

---

sometimes if an associative array is small, it may be more efficient to implement it as a list:

https://news.ycombinator.com/item?id=9162458

---

kentonv 3 days ago

Ruby (or maybe specifically Gems) startup is still quadratic in another -- IMO worse -- way: Every require() (that misses cache) searches for the requested file in _every_ dependency package until found. It seems that instead of having the file name itself include the precise package name, Ruby instead treats the packages as a search path that unions a bunch of separate trees into a single logical file tree.

The result is that Ruby does O(n^2) stat() calls at startup, where n is the number of dependency packages. Most of these operations are for files that don't exist, which means they'll miss the filesystem cache and perform absolutely atrociously on top of FUSE, NFS, etc. A typical Ruby app sitting on a loopback FUSE device (i.e. one just mirroring the local filesystem, no network involved) will take minutes to start up, even with aggressive caching enabled. :(

---

didip 8 hours ago

Question for Rust peeps out there:

Is there a documentation on how to build single binary executable for Rust?

steveklabnik 8 hours ago

I'm assuming you're referring to static linking.

Right now, we depend on glibc, and it cannot be statically linked. By default, Rust applications statically link everything but glibc already.

At least one core team member is strongly advocating for musl support for truly standalone binaries in the 1.1 timeframe, but nothing is likely to change for 1.0, and it's too early to decide on 1.1 features yet :)

xyzzy_plugh 8 hours ago

  > Right now, we depend on glibc, and it cannot be statically linked.

Could you expand on this? Why can it not be statically linked?

spystath 8 hours ago

Name resolution libraries are loaded dynamically [1,2], so glibc can't be linked statically if you want NSS to work. I gather there might be a way to explicitly link to a specific resolution library but this is practically unsupported/poorly documented.

[1]: https://sourceware.org/glibc/wiki/FAQ#Even_statically_linked...

[2]: http://stackoverflow.com/questions/3430400/linux-static-link...

xyzzy_plugh 8 hours ago

I understand this limitation. I can currently get around this by using --enable-static-nss.

Does rust currently expose any way to do this? Could I perform linking myself as a workaround?

steveklabnik 8 hours ago

> Could I perform linking myself as a workaround?

You can but it requires lots of black magic, and certainly a nightly rather than beta Rust ;)

http://mainisusuallyafunction.blogspot.com/2015/01/151-byte-... is an old blog post that can get you started.

---

good read:

Python without an operating system http://lwn.net/Articles/641244/

---

here's various options that Nim has that let you override various things to make your binary size super small (possibly useful for embedded systems):

http://hookrace.net/blog/nim-binary-size/

oot should probably support many of these options

---

this comment suggests that the Erlang VM is better at latency than even Go: https://news.ycombinator.com/item?id=9670133

---

smegel 1 day ago

I think this would have been much better titled "Boosting NGINX Performance 9x with Asynchronous wrappers around blocking system calls".

Most people when hearing about "thread pools" in the context of a web-server think about using multiple threads for handling separate requests, which is NOT what this is about. It is using threads to enable some blocking syscalls (read and sendfile) to run asynchronously from the main event loop.

There's already a library for that! http://software.schmorp.de/pkg/libeio.html

misframer 1 day ago

Similarly, there's libuv [0], which is used for Node.js and others.

[0] https://github.com/libuv/libuv

rgbrenner 1 day ago

libuv is for C projects that miss the joy of javascript callback hell.

anacrolix 1 day ago

Or don't want to have 10000 threads. I have a Go server that regularly bumps 140k goroutines. Try that shit with native threads.

rgbrenner 1 day ago

libuv isn't unique.. It's equivalent to libev + libeio.. in fact, that's what nodejs used before writing libuv. Whether or not it's faster than those is really case-by-case.. but what you'll definitely get with libuv is callbacks everywhere.

jimjag 20 hours ago

As well as an Apache httpd MPM: Event

---

the Oot equivalent of this should work in the REPL:

raise Exception(u'Error when printing ü')

(works in Python 3, not Python 2)

---

https://news.ycombinator.com/item?id=9983282

bitL 7 hours ago

parent

flag

Imagine immutable dict. How many programs in Python you know would still work?

tikhonj 1 hour ago

parent

flag

Modern radix trees[1] can be as fast as the hash maps you'd find in a language's standard library while providing additional capabilities (ie fast ordered traversal). Making them immutable and persistent[2] is easy and only adds a bit of performance overhead.

Obviously programs that rely on communicating by mutating parts of a dictionary would have to be rewritten to use an immutable, persistent map—but they could still maintain similar performance.

[1]: Namely "adaptive radix trees" as described in “The adaptive radix tree: Artful indexing for main-memory databases” by V. Leis, A. Kemper, and T. Neumann <http://www3.informatik.tu-muenchen.de/~leis/papers/ART.pdf>

[2]: A persistent adaptive radix tree in Java: https://github.com/ankurdave/part

bitL 1 hour ago

Thanks for the radix tree reference!

On the other hand, stating "they could still maintain similar performance" for immutable dictionaries is misleading at best (you are right for dictionary sizes approaching 0). Anyone can look up benchmarks on the Internet showing real-world performance of immutable data structures and make their mind up if they actually can deliver in their particular use cases.

Mutability could be always used to boost performance on Von Neumann architecture if you know what you are doing. It's just becoming impractical due to increased complexity, pervasive security paranoia and the fact that there are very few people that can get things done right. If you want to contribute to advance your case, please create a competing computer architecture as fast as Von Neumann's, physically possible, but based on immutability.

semi-extrinsic 2 minutes ago

> Mutability could be always used to boost performance (...)

For the extreme case, see the EQUIVALENCE statement in old Fortran.

---

this Redis blog post describes how they make the memory-freeing command async (spawn a background thread to actually free the stuff asynchronously; he calls this 'lazy'):

http://antirez.com/news/93

summary relevant to here:

initially multithreaded access to data structures was a problem because data structures had sharing (sounds like something like COW)
so he ripped out the sharing; all data structures now copy
this had additional speed benefits in terms of removing one level of indirection
there is still sharing used during certain commands, but not in the basic implementation of data structures
he's thinking of moving from a mostly-singlethreaded model to a model where there can be multiple threads handling requests, and only access to the db requires a global lock (so eg the threads can concurrently parse incoming requests and create outgoing replies). Furthermore he seems to be thinking of replacing the global lock to the db with per-key locks for certain operations.
not noted in that blog post, but apparently Redis is sequentially consistent (?not positive about this, i can't find confirmation in the official documentation?). I assume that any business with multithreaded access to the db would preserve this.

my takehome:

just copy for non-threadlocal data, not COW
and consider copying even for threadlocal data if you want to have the GC in another thread; apparently a read barrier is quite expensive (otoh for the sake of not closing off options/extensibility i think we should have a read barrier regardless, at least at first) (also: i'm guessing we would rather be frugal with memory than fast, right? hmm actually that's not clear)
consider allowing annotations for COW-like-sharing during delimited codepaths only
until we see the results, i'm skeptical that the idea of introducing more locking to redis will ultimately improve performance without significantly hurting latency
the sequence of: (singlethreaded; vs global lock on db; vs per-key lock) might have something to teach us about not-fully-concurrent alternatives to eg Python's GIL. however recalling the option to temporarily escape the GIL provided you don't access any Python data structures in Cython, i bet Python is already in the 'global lock on db' stage. 'per-key locks for certain operations' would probably have to include READ barriers to preserve sequential consistency with a simple implementation.

---

would Oot rather be frugal with memory than fast?

(summary answer: i dunno)

arguments for memory at the expense of speed:

the opposite, speed at the expense of memory, is what Java does, and it takes too much memory. Its memory consumption appears to affect startup latency on consumer machines, and certainly affects its use on VPSs
- rebuttal: (a) i'm not sure that the memory usage is actually causing significant startup latency, (b) efficiency for server-side apps is a non-goal of Oot
brain-like computers will have more CPUs than present day, so (local) memory-per-CPU will be less
keeping things in cache probably helps actual speed more than optimizations that slightly reduce computation; history shows that L1 cache size appears to be increasing very slowly even as main memory size increases quickly

arguments for memory at the expense of speed:

since Moore's law has ended for serial processing speed, but memory keeps getting cheaper, serial processing speed is more precious here. By contrast, memory bloat will continue to become less and less noticable.
- rebuttal: the chart at the end of this article says exactly the opposite: http://motherboard.vice.com/read/memory-is-holding-up-the-moores-law-progression-of-processing-power . Otoh this person claims the opposite of that: http://programmers.stackexchange.com/a/38255/167249

---

may or may not be useful:

http://stackoverflow.com/questions/18459633/how-erlang-processes-access-mailbox-concurrently

---

another regex library:

https://github.com/google/re2

an interesting data structure library:

http://blog.burntsushi.net/transducers/

this data structure is good for efficiently storing and querying (including regexps and finding Levenshtein distance and doing set operations) an ordered set or ordered map when many of the keys share prefixes or suffixes (such as is the case for natural-language words; Lucene uses this for indexing)

---

firefox memory profiling:

" The Memory tool works by taking snapshots of everything in memory, and presenting them as a tree/table with various grouping settings. By default, the contents are grouped by “coarse type,” where each thing in memory falls into one of four classifications:

    Objects: JavaScript objects. Further grouped by each object’s internal [[Class]] name.
    Scripts: The JavaScript source text loaded by the web application and its resulting executable machine code produced by SpiderMonkey’s JIT compiler, IonMonkey.
    Strings: JavaScript strings used by the web application.
    Other: Miscellaneous structures that do not fit in the above categories.

You can also group the snapshot by “object class,” which groups by their JavaScript? [[Object?]] class, or by “internal type,” which groups things by their C++ type names. This latter view is mostly useful for Firefox platform developers.

Perhaps most interesting is the fourth and final grouping option: “allocation stack.” You have to turn this option on manually by ticking the “record allocation stacks” checkbox at the top of the Memory panel, since tracking allocations can degrade the application’s performance while the box is checked. The payoff, however, is worth it: this view groups the things in the heap by the source location in your JavaScript? code. "

---

BenoitP? 2 days ago

The Go runtime is really starting to look sexy. 20 ms GC pauses on 200+ GB heaps!

I remember a thread discussing a pauseless GC on the dev mailing-list; where Gil Tene, of Azul C4's fame, commented on having a clear policy on safepoints from the beginning was paramount to having a good GC. It looked like the community is very biased towards doing the fundamental things well, and attracting the right people.

And on top of that we're get BLAS and LAPACK bindings from https://github.com/gonum

---

 mwcampbell 3 hours ago

A large number of dependencies is only a problem in environments that aren't amenable to per-function static linking or tree-shaking. These include dynamically typed languages like Python, Ruby, and JavaScript? (except when using the Google Closure Compiler in advanced mode), but also platforms like the JVM and .NET when reflection is allowed. Where static linking or tree-shaking is feasible, the run-time impact of bringing in a large library but only using a small part of it is no more than the impact of rewriting the small part you actually use.

Edit: Dart is an interesting case. It has dynamic typing, but it's static enough that tree-shaking is feasible. Seth Ladd's blog post about why tree-shaking is necessary [1] makes the same point that I'm making here.

[1]: http://blog.sethladd.com/2013/01/minification-is-not-enough-... http://blog.sethladd.com/2013/01/minification-is-not-enough-you-need.html reply

nostrademons 2 hours ago

Not always. Dependencies were a huge problem at Google, even in C++ (perhaps especially in C++), because they mean that the linker has to do a lot of extra work. And unlike compiling, linking can't be parallelized, since it has to produce one binary. At one point the webserver for Google Search grew big enough that the linker started running out of RAM on build machines, and then we had a big problem.

There's still no substitute for good code hygiene and knowing exactly what you're using and what additional bloat you're bringing in when you add a library.

---

post on Perl6 implementation noting that Parrot has a problem b/c:

Parrot classes ('PMCs') each have to support a bunch of base operations (cloning etc)
therefore, adding a new PMC class, even a subtype, requires adding a largish list of function pointers to a table to represent these operations, which adds to memory costs and startup time
therefore, adding new PMC classes is done sparingly
therefore, instead of having more subtypes, PMC class methods just have lots of conditionals
which is ugly

http://www.modernperlbooks.com/mt/2011/05/when-you-lack-cheap-and-easy-polymorphism.html

---

wow look at everywhere Lua JIT works:

http://luajit.org/luajit.html

" Compatibility Windows Linux BSD OSX POSIX Embedded Android iOS PS3 PS4 PS Vita Xbox 360 GCC CLANG LLVM MSVC x86 x64 ARM PPC e500 MIPS Lua 5.1 API+ABI + JIT + BitOp? + FFI Drop-in DLL/.so Overview 3x

100x 115 KB VM 90 KB JIT 63 KLOC C 24 KLOC ASM 11 KLOC Lua "

---

    Fast iteration: “Save my file, hit reload in the browser, see what went wrong, fix it.”

With Hack, this short cycle time had to be preserved.

Implications are huge:

    Type checking algorithms must be fast and mostly incremental

    A persistent daemon maintains type system state

    Filesystem change notifications lead to “instant” type checking

    Typical typechecking time is 200ms (outliers up to 1000ms) for millions of lines of code

" -- http://www.scs.stanford.edu/16wi-cs240h/slides/fb-slides.html#%2843%29

---

" firebones 1 hour ago

Someone posted a great question here that they later deleted. My full response also wasn't submitted (and lost), so I'm trying to recreate part of it here. It was actually a good question about whether Rust's efficiency has been tested.

I've done some testing; Java is faster in terms of the sip test for a lot of stuff, especially for cases where you micro-benchmark single processes without regard to overall CPU or memory usage. However, when you scale up (to the point where JVM tuning comes into play), the raw performance in terms of response time is comparable, while the overall system utilization (e.g., "time" measurements of user/sys) is much better for Rust--Java uses a lot more system time due to JVM housekeeping threads. So more efficient is not a win for Java, even though wall-clock benchmark time might look better for a single process (as long as you don't look at overall system utilization.)

Java's performance comes at the price of more memory usage and higher variance in response time. You can attempt to constrain one or the other, but you can't constrain both at the same time. Microbenchmarks typically put a high enough ceiling on memory that they're relevant for very constrained loads. To be robust beyond the microbenchmark, incremental GC is required, which levels the playing field between Java GC and Rust's jemalloc. Still, even when you attain parity, the Java solution is using far more memory due to the JVM's OOP overhead.

Another example of a hidden "cost" of Rust comes from things like Unicode support. Recently I tested a regex-heavy algorithm that used burntsushi's regex library for Rust. In initial testing, Java blew it away. What I later realized was that the default Java implementation I used did not support Unicode characters. When I enabled that support, and enabled incremental GC (to support the scale of testing I was performing), the performance was similar.

This is another example of Rust being mischaracterized up front. Older languages took short cuts, and fare well for the low end of testing (the microbenchmark). Rust tends to look forward and to the bigger picture. "

---

 [https://github.com/mame/quine-relay One Hundred Language Quine Relay]

SmartOS? used this to test their system (an "important early milestone") [1]

" Assuming that runs, my next questions are all common apps/programs that we knew to be especially thorny for one reason or another: Go, strace, tcpdump, systemd, etc.

...I've never gotten a straight answer as to why Go makes system calls directly instead of doing what every other program on the planet does and calling into the system libraries. Certainly, it made the ports to non-Linux systems absolutely brutal: systems are required to make themselves look like Linux for Go to function correctly.

---

tonyarkles 3 days ago

Some experiments I did during my coursework for my M.Sc. indicated that it was actually worse-than-useless at the time (2009?) Here's what would happen:

When you've got one CPU core, the threads basically just act like a multiplexer. One thread runs for a while, releases the GIL, and the next thread runs for a while. Not a big deal.
When you've got multiple CPU cores, you've got a thundering herd. When the lock is released, the threads waiting on all of the other cores all try to acquire the lock at the same time. Then after one thread has run on, say, core 3, it's gone and invalidated the cache on the other cores (mark & sweep hurts caches pretty badly). The thundering herd stampedes again and the process continues.
To make matters even worse, each core runs at low utilization (e.g. a quad core machine, each core runs at ~25%). If you've got CPU throttling turned on (which my laptop, where I started the experiments, did), then the system detects that the CPU load is low and scales down the clock speed. Normally, this would result in increased CPU utilization, which would speed the CPUs back up again. Unfortunately, the per-core utilization stays pegged at 25% and things never speed back up again. The system looks at it and says "huh! only 25%! I guess we've got the CPU speed set properly!"

Maybe it's gotten better since then? I haven't checked recently.

Edit: I wish I had the results handy. The basic conclusion was that you got something like a 1.5x slowdown per additional CPU core. That's not how it's supposed to work! Using taskset to limit a multi-threaded Python process to a single core resulted in significant speedups in the use cases I tried.

nitely 3 days ago

This is a known issue in py2. On py2 when running in a multi-core machine it'll run ~1.8x slower (depending on what you are doing) than it'll run in a single-core machine. Python 3.2 ships a new GIL[0] fixing the problem.

[0] http://www.dabeaz.com/python/NewGIL.pdf

---

saboot 3 days ago

I never quite grasped the actual machinations of Python until I watched Philip Guo's lectures on Python internals.

https://www.youtube.com/watch?v=LhadeL7_EIU

It's a bit long, and definitely over several sittings, but I feel like I really understand Python better and relevant to the post, the complexities and tracking (frames, exceptions, objects, types, stacks, references) that occur behind the curtain which drive Python's slow native performance.

---

... I think a lot of people would be shocked if they sat down and really worked out just how little Python/Perl/etc. code they can run before they webserver noticeably starts slowing down at even small scales.

stickfigure 3 days ago

Agreed.

I spent a little time working with Guido on cache design for the App Engine data access api (ndb). In Java-land, it's a big win to cache datastore entities in memcache because fetch latency is about an order of magnitude faster. In Python-land, caching is a mixed bag - real-world profiling found marginal benefit even with 100% hit rate, and a significant performance disadvantage if the hit rate drops off. This is primarily attributed to pickling overhead.

---

nostrademons 11 hours ago [-]

 There's actually a fairly long history of cross-language VMs, with various degrees of success. What usually happens is that they work fine for languages that look, semantically, basically like the native language on the VM. So LLVM works well as long as your language is mostly like C (C, C++, Objective-C, Rust, Swift). Parrot works if you language is mostly like Perl 6 (Perl, Python, PHP, Ruby). .NET works if your language is mostly like C# (C#, F#, Boo, IronPython). The JVM works if your language is mostly like Java (Java, Clojure, Scala, Kotlin). Even PyPy has frontends for Ruby, PHP, Smalltalk, and Prolog.

"Mostly" in this case means semantically, not syntactically. It's things like concurrency model; floating point behavior; memory layout; FFI; semantics of basic libraries; performance characteristics; and level of dynamism. You can write a Python frontend for both the JVM and CLR, and it will look like Python but act like Java and .NET, respectively, with no guarantee that native CPython libraries will work on it.

---

"Regarding the implementation ((of Goroutines)), I think they're quite similar to the (unfortunately not too well-known) "State Threads" library, just quite lower-level (as Go doesn't rely on libc or things like this and talks directly to the OS kernel) — you could read the introductory paper for the ST library where the concept is quite well explained."

http://state-threads.sourceforge.net/

"Libmill is a library that introduces Go-style concurrency to C." http://libmill.org/

---

Olin Shivers:

"ANF has been the fashionable representation for about ten years now; CPS is out of favor. ...However, just to tell you where I am on this issue, I think the whole movement from CPS to ANF is a bad idea (though Sabry & Felleisen's technical observations and math are as rock solid as one would expect from people of their caliber)."

---

David "Kelsey subsequently spent time as a prof at Northeastern, then left for NEC's prestige lab in Princeton, where he worked on the Kali distributed system. He also got set down precisely on paper something all the CPS people knew at some intuitive level: that the whole "SSA" revolution in the classical compiler community was essentially a rediscovery of CPS. (Harrumph.) " (Olin Shivers, [2])

---

some stack alternatives that i don't understand:

" In what I'll call 'traditional TCO', activation records are allocated on a stack and under certain conditions (tail-calls) one will reuse the activation record. Thus the stack doesn't actually grow (except perhaps if the new activation record is larger than the prior one). As an alternative to the stack the activation records are created on the heap. For performance, this is usually in a clean nursery-region of the heap where one can increment a pointer to perform an allocation. One typically compiles into continuation-passing style, and each continuation therefore serves as a thread. A copying garbage collector ensures there is always clean space for a new nursery. For lack of a better name, I'll call this the 'generic no-stack' design. From the generic no-stack design, one might express a 'generic TCO' which simply reuses the same activation-record - no matter where it is allocated (i.e. on the stack or the heap). However, with CPS this isn't very useful (excepting to avoid a simple allocation) since the continuation is no longer implicit to the stack. In Chicken Scheme, the 'no-stack' approach is adapted to work with C as a target language. The data uses automatic allocation on the C stack, which will grow with each function call... but at a certain size (default 64 kB, IIRC, checked via a pointer comparison) a small copying-GC moves the important items from the C stack to the heap, then a longjump resets the stack with the new continuation. The C stack serves as a nursery, and the C language also implements the various compiled functions. Another variation on the no-stack design is Stallman's 'Phantom Stacks'. Unlike the generic (CPS) variation, one will occasionally return to an older activation record before continuing the computation, thus old activation records represent both history and futures - as per the traditional 'stack'. If a dangerous reference is created (a pointer from heap to stack, or a pointer from earlier in stack to later in stack as per upfunarg) a copying collection is performed immediately and a new stack is created. ... IIRC, Baker also produced something similar, using particular arrangements of pointers - critically, with the stack growing away from the heap - to reduce to a simple comparison the detection of dangerous references. " -- [3]

---

Making TCO explicit

If you are a tenant of the "TCO is just an optimisation" camp, then there is no incentive to make it explicit.

On the other hand, if you believe that TCO is necessary for the semantics of the language (as argued in the original post), then it would make a lot of sense to make it explicit. If nothing else, it would allow the compiler/runtime to tell you "that call is marked as tail-call, but there is no way it could be so go and rewrite this code properly".

The recursive call syntax is so convenient though. I would not trade it for goto. By Denis Bredelet -jido at Fri, 2009-12-04 13:07

---

so, in Oot, we want to be passing around tables (graphs) where other languages would be passing class instances. But, we want a structural type system to guarantee that the shape of the table matches this or that pattern. This means that we can represent or almost represent structs by a table that is typed by a pattern that exactly defines the shape of the table.

But now what if we have a table with a type that says it has fields x, y, and z, but doesn't prohibit it from having other fields also? Eg most of the time this is some simple thing that only has the given fields, but sometimes some other code extends it with some other data. We'd like the compiler to still be able to represent this in an efficient way for the common case of accessing and storing only the given fields, rather than just going back to a hashtable.

One way would be to represent it as an ordinary struct, with an extra field (call it 'extra' for now). 'Extra' can then point to another struct with another 'extra', or just to a hashtable. Now the objects has a statically known fixed size in memory, like a struct, and the common case of accessing the initially defined variables can be accomplished without indirection, like a struct.

This is sort of the implementation equivalent of 'row polymorphism' in type systems:

https://brianmckenna.org/blog/row_polymorphism_isnt_subtyping

---

kxyvr 16 days ago

parent [-]

on: Higher-order abstract syntax

If anyone want another concrete example of a tagless interpreter in OCaml, I wrote one on SO awhile back:

https://stackoverflow.com/questions/22676975/simple-lambda-c...

Candidly, I find them a pain to write, but it is faster and integrates with the host language's type checker.

---

"as compiler optimizations exploit increasingly recondite properties of the programming language definition, we find ourselves having to program as if the compiler were our ex-wife’s or ex-husband’s divorce lawyer, lest it introduce security bugs into our kernels, as happened with FreeBSD? a couple of years back with a function erroneously annotated as noreturn, and as is happening now with bounds checks depending on signed overflow behavior."

---

hey, a Clojure implementation with low startup latency!

https://anmonteiro.com/2016/11/the-fastest-clojure-repl-in-the-world/

---

lobster_johnson 2 days ago [-]

While I agree wholeheartedly that Erlang's design is better:

To be fair, you can't get Erlang's semantics without also inventing processes and supervisor trees. And those things would be largely meaningless without immutability.

Erlang's magic comes from how all the puzzle pieces fit together into a whole: For example, with immutability comes the ability to always be able to restart from a known state, and probably greatly simplifies the internal implementation of the per-process heap.

Those things fundamentally change the language; Go couldn't adopt Erlang's semantics without eschewing almost everything that makes it "Go".

Error handling is really the least of it. Attempting to replicate some of Erlang's high-level design principles at a quickly lead to roadblocks. For example, goroutines cannot be killed; this means it's impossible to fully implement supervisor trees in Go. It's possible that the runtime could support this at some point, but it's a big can of worms.

solidsnack9000 2 days ago [-]

Immutability probably does simplify a lot of things; but it isn't actually required to get Erlang's model, since Erlang isolates the state of processes from each other. Each process could have its own isolated mutable state as opposed to an immutable one and it wouldn't change the "let it crash" philosophy.

dragonwriter 2 days ago [-]

>Each process could have its own isolated mutable state

Erlang processes do have their own isolated mutable state.

But it's walled off and accessed through calls rather than variable references; in the non-SMP implementationthis enabled copy-free local (same node) message sends (and it still does, but only for large binaries -- the more general use turned out to be less efficient than copying in SMP because of locking issues related to garbage collection and some other implementation details.)

lobster_johnson 2 days ago [-]

You can't guarantee that isolation without immutabilty, or at least forced copying.

In Go, it's extremely easy to accidentally send a pointer (or a data structure that contains a deeply nested pointer somewhere; this includes embedded maps, slices and channels, which are all reference types) on a channel, which is bound to break mutability guarantees at some point.

solidsnack9000 1 day ago [dupe] [-]

Mutable locals do not necessarily imply pointers...

lobster_johnson 1 day ago [-]

No, but how do you pass "mutable locals" to other processes? That's the whole question.

If you pass a "local" on a channel to a different goroutine, Go can't guarantee to the receiver (which now has it as a "local" variable) that the sender (which still has it as a "local") won't change it:

    d := map[string]int{}
    ch <- d
    d["foo"] = 42  // Oops!

The only way to do this safely is to to write a channel abstraction which uses gob or similar to marshal and unmarshal all data, thus copying it and preventing concurrent mutation, and then always use the channel abstraction. But always copying tends to be terrible for performance. Erlang is able to do zero-copy sends to process mailboxes because the data is guaranteed to be immutable.

Sure, you could instead mandate that senders never hold on to data sent on channels, but can you ensure this 100% across a codebase contributed to by an entire team?

klibertp 1 day ago [-]

> Sure, you could instead mandate that senders never hold on to data sent on channels, but can you ensure this 100% across a codebase contributed to by an entire team?

Isn't that precisely what Rust is all about?

lobster_johnson 1 day ago [-]

Pretty much, yes.

---

https://m.phys.org/news/2017-01-middle-popular-yields-more-efficient-parallel.html

" Cilk adds just two commands to C: "spawn," which initiates a fork, and "sync," which initiates a join. That makes things easy for programmers writing in Cilk but a lot harder for Cilk's developers. ... All previous compilers for fork-join languages are adaptations of serial compilers and add the runtime invocations in the front end, before translating a program into an intermediate representation, and thus before optimization. In their paper, the researchers give an example of what that entails. Seven concise lines of Cilk code, which compute a specified term in the Fibonacci series, require the compiler to add another 17 lines of runtime invocations. The middle end, designed for serial code, has no idea what to make of those extra 17 lines and throws up its hands.

The only alternative to adding the runtime invocations in the front end, however, seemed to be rewriting all the middle-end optimization algorithms to accommodate the fork-join model. And to many—including Leiserson, when his group was designing its first Cilk compilers—that seemed too daunting.

Schardl and Moses's chief insight was that injecting just a little bit of serialism into the fork-join model would make it much more intelligible to existing compilers' optimization algorithms. Where Cilk adds two basic commands to C, the MIT researchers' intermediate representation adds three to a compiler's middle end: detach, reattach, and sync.

The detach command is essentially the equivalent of Cilk's spawn command. But reattach commands specify the order in which the results of parallel tasks must be recombined. That simple adjustment makes fork-join code look enough like serial code that many of a serial compiler's optimization algorithms will work on it without modification, while the rest need only minor alterations. "

https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf

---

jimrandomh 14 hours ago [-]

Non-Rust-user here. I tried Rust about a year ago and gave up on it as not ready. Unfortunately, I don't think this roadmap addresses the problems that made me reject it the first time around.

The main problem I have with Rust is that it's not up to the task of dealing with C APIs, particularly POSIX. The issue I got stuck on was https://www.reddit.com/r/rust/comments/47a0s3/dealing_with_v... . There are two possible things Rust could do that would make me give it a second chance: either commit to maintaining bindings for all of POSIX and libc as part of Rust core, or fold rust-bindgen into core and get it into shape.

Creating higher-level libraries like Tokio is nice for some use cases, but right now Rust doesn't have a working safety valve for things C can do that Rust can't. This greatly magnifies problems like lack of bindings for select() (see http://esr.ibiblio.org/?p=7294&cpage=1 ; I too ran into that problem, and lost a few days to it.)

steveklabnik 14 hours ago [-]

Did you try the nix crate? It looks like maybe not. It should have all of that stuff already wrapped for you. For example: https://docs.rs/nix/0.7.0/nix/sys/termios/fn.tcgetattr.html (select is in there too https://docs.rs/nix/0.7.0/nix/sys/select/fn.select.html )

rust-bindgen has been improving a ton, and it is in fact on that roadmap, under

> Integration with other languages, running the gamut from C to JavaScript?.

jimrandomh 14 hours ago [-]

Happy to hear bindgen is getting attention; I didn't actually notice that when I looked through the roadmap.

I didn't try the nix crate at the time, but looking at it just now - it doesn't solve the portability issue. It defines struct Termios in https://github.com/nix-rust/nix/blob/master/src/sys/termios.... , with something #ifdef-ish branching on operating system, but not on CPU architecture. On quick inspection, I think it's probably incorrect on x86-32, and this crate is definitely a major liability for portability.

---