proj-oot-ootImplementationNotes2

how are we going to reconcile our requirement that the main interpreter/compiler be canonical with our requirement that it be small and simple?

if oot succeeds but it remains small and simple, it won't have many optimizations, and then ppl will switch to whatever other interpreters/compilers are made by others

maybe allow tons of optimizations, but only in the form of rewriting rules in separate pipeline stages between the main compiler stages? so you could just take out all of the optimization code if you wanted and the language would still work? so then the subset of the compiler with the optimizations removed is small and simple, but the whole thing is somewhat larger?

--

" The binary ended up a bit bigger than I’d like. Adding the GTK and OBus libraries in particular added a lot to the size (though they are optional). The main problem with GTK is that it has to be compiled as a plugin, because we don’t know if the target system will have libgtk. If we used a single binary and the library wasn’t present, it would refuse to start. By having it in a plugin we can try to load it, but fall back to console mode if that fails. However, compiling for plugins prevents OCaml from optimising the binary by removing unused library functions, since it doesn’t know which ones might be needed by the plugins.

The binary is compiled with debug symbols, but compiling without has almost no effect (binary becomes 1.5% smaller). "

" The ocamlbuild command automatically discovers dependencies and rebuilds only what is needed, so incremental builds are usually fast and are generally reliable (the exception is that it doesn’t notice if you remove or rename a file, but you always get an error message in that case rather than an incorrect build).

Most errors are picked up by the type checker immediately at the start of the build, rather than by the unit-tests at the end. That saves a lot of time. "

" Two things did speed it up slightly: building the tests and the main binary with a single invocation (saves having to run the dependency checker twice) and turning on parallel builds. Parallel builds didn’t help as much as I’d hoped however.

Update: edwintorok profiled the build and noticed that 25.5% of the time is spent running a bytecode version of the camlp4 pre-processor (which we use for the Lwt syntax extension and for conditional compilation) and 10.5% is spent on a bytecode version of ocamlfind (looks like an ocamlbuild bug). Why ocamlbuild’s parallelization is often disappointing today looks interesting too. "

"

There are some changes (module aliases) coming in OCaml 4.02 which should help. Currently, if I change one of the files in the Support module (e.g. Support.Sat) then it first rebuilds Sat, then rebuilds Support with the new Sat module, then rebuilds everything that uses Support (which is everything). In reality, it only needs to rebuild Zeroinstall.Solver when Sat changes. "

" If you do need to modify one of the early modules and run the unit tests quickly, a good trick is to compile to byte-code rather than to native. The byte-code compiler doesn’t do cross-module inlining optimisations, which means that as long as a module’s interface doesn’t change, it doesn’t need to recompile the things that depend on it. "

--

remy says: November 16, 2008 at 7:19 pm

Hi,

I had the same idea, you should have a look to Phil Bagwell “Fast Functional Lists” Since J. McCarthy? first introduced Functional Programming, the Linked List has almost universally been used as the underpinning data structure. This paper introduces a new data structure, the VList, that is compact, thread safe and significantly faster to use than Linked Lists for nearly all list operations. Space usage can be reduced by 50% to 90% and in typical list operations speed improved by factors ranging from 4 to 20 or more. Some important operations such as indexing and length are typically changed from O(N) to O(1) and O(lgN) respectively. In the current form the VList structure can provide an alternative heap architecture for functional languages using eager evaluation. To prove the viability of the new structure a language interpreter Visp, a dialect of Common Lisp, has been implemented using VList and a simple benchmark comparison with OCAML reported. Reply

--

don't make these mistakes with the implementation of generics:

http://nomothetis.svbtle.com/smashing-swift

three issues:

--

the nimrod project thinks that bootstrapping with an LLVM backend is much harder than with a direct-to-C backend

http://forum.nimrod-lang.org/t/480

--

bkeroack 1 day ago

link

Depends on how you want to program: imperative vs functional.

Personally I think list comprehensions are the most beautiful part of Python, though sometimes I use map() when I'm trying to be explicitly functional (I realize it's allegedly slower, etc).

Generally I think list comprehensions are cleaner and allow you to write purer functions with fewer mutable variables. I disagree that deeply nested for loops are necessarily more readable.

reply

SEJeff 1 day ago

link

Map is not allegedly slower, it is demonstrably slower.

$ python -mtimeit -s'nums=range(10)' 'map(lambda i: i + 3, nums)' 1000000 loops, best of 3: 1.61 usec per loop

$ python -mtimeit -s'nums=range(10)' '[i + 3 for i in nums]' 1000000 loops, best of 3: 0.722 usec per loop

Function calls have overhead in python, list comprehensions are implemented knowing this fact and avoiding it so the heavy lifting ultimately happens in C code.

reply

---

if the Oot version is an integer in a fixed-length field in some binary version (as it is in Lua, which uses one byte), how long should it be? One byte seems way too few to me..

$ git clone git:git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

$ git clone git:git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git Cloning into 'linux'... remote: Counting objects: 3662322, done. remote: Compressing objects: 100% (548398/548398), done. Receiving objects: 31% (1167486/3662322), 398.65 MiB?

remote: Total 3662322 (delta 3085574), reused 3660773 (delta 3084236) Receiving objects: 100% (3662322/3662322), 770.73 MiB?Resolving deltas: 100% (3085574/3085574), done.
131 KiB?/s
178 KiB?/s, done.

$ cd linux/ $ git rev-list --all

456474 # all commits $ git log --pretty=oneline 456474 # all commits on current branch
wc -l
wc -l

so even 16 bits would not be enough to contain each commit. But how many releases are there?

$ git tag -l

380
wc -l

So 16 bits would be more than enough to contain every release. Which makes sense; even if there were one release a day (!), 65536 releases would take 179.55 years.

So 16 bits.

---

Lua has an 'endianness flag' in its binary headers

it also has:

Size of int (in bytes) (default 4) Size of size_t (in bytes) (default 4) Size of Instruction (in bytes) (default 4) Size of lua_Number (in bytes) (default 8) Integral flag (default 0) • 0=floating-point, 1=integral number type

" A Lua 5.1 binary chunk header is always 12 bytes in size. Since the characteristics of a Lua virtual machine is hard-coded, the Lua undump code checks all 12 of the header bytes to determine whether the binary chunk is fit for consumption or not. All 12 header bytes of the binary chunk must exactly match the header bytes of the platform, otherwise Lua 5.1 will refuse to load the chunk. The header is also not affecte d by endianness; the same code can be used to load the main header of little-endian or big-endia n binary chunks. The data type of lua_Number is determined by the size of lua_Number byte and the integral flag together. In theory, a Lua binary chunk is portable; in real life, th ere is no need for the undump code to support such a feature. ...

Anyway, most of the time there is little need to ser iously scrutinize the header, because since Lua source code is usually available, a chunk can be readily c ompiled into the native binary chunk format "

--

in Lua, the binary function header chunks are of variable size only due to the 'source name' field, which, for the 'top-level' function of each file, contains the file name, i guess for debugging. maybe in oot it would be better to have these be of fixed size, and to standardize the length of 'source name', and to remove it for stripped files. otoh if you do that then it becomes difficult to have that scripting language magic function which tells a bit of code which filepath it is in.

in binary chunks, Lua strings have a length preceding them (of size size_t, which on x86 is 32 bits), AND a null byte at the end. i like that.

in binary chunks, Lua lists have a length preceding them, of size size_t, which on x86 is 32 bits. The instructions are themselves in such a list. So can't have more than 4gi (4 gigs) of instructions in one function.

--

https://github.com/akeep/nanopass-framework

--

http://www.erlang-factory.com/upload/presentations/708/HitchhikersTouroftheBEAM.pdf

---

hlvm and vmkit and llvm:

http://www.archivum.info/comp.lang.functional/2009-01/00041/Re-HLVM.html

---

http://blog.omega-prime.co.uk/?p=135

---

http://akkartik.name/post/wart-layers has the idea of a program built out of (an ordered sequence of) layers which each include their own tests; (the sum of a prefix of) layers are like versions, and the result of stacking any prefix of layers is a program which passes its own tests.

(i guess these layers are not quite like chronological versions because you can make changes to an underlying layer after you have added a new layer on top of it)

---

llvm tail calls literally just reuse stack frame and so require caller callee identical signatures (prototypes)

llvm cannot do tail calls if 'invoke' is used (which is 'call' but with exception handling) b/c it cant bubble up exceptions thru discarded stack frames

this suggests that llvm 'call' w/o delimited continuations etc is a useful special case that could be identified in annotaions of oot AST because it would allow lower levels of the compiler to optimize it to platform primitives (such as LLVM CALL, and many assembly languages CALL/RET, which GHC Haskell apparently doesn't use due to the need to return to after an 'info table' instead of immediately after CALL instruction)


there is or used to be some sort of problem in Haskell with exceptions and the type system that can cause a compiler crash:

http://existentialtype.wordpress.com/2012/08/14/haskell-is-exceptionally-unsafe/

---

consider using SEAM:

https://synrc.com/publications/cat/Functional%20Languages/AliceML/christian_mueller_diplom.pdf chapter 2

---

"

For implementing Python, you have at least five options:

If you're willing to restrict the language, it's much easier. RPython was written only to help build PyPy?, but the concept could be extended to allow most of Python. Both Shed Skin and RPython insist that type inference succeed at disambiguating types. If you're willing to accept using an "any" type when type inference fails, you can handle more of the language.

The big boat-anchor feature of Python is "setattr", combined with the ability to examine and change most of the program and its objects. This isn't just reflection; it's insinuation; you can get at things which should be private to other objects or threads and mess with them. By string name, no less. This invalidates almost all compiler optimizations. It's not a particularly useful feature. It just happens to be easy to implement given CPython's internal dictionary-based model. If "setattr" were limited to a class of objects derived from a "dynamic object" class, most of the real use cases (like HTML/XML parsers where the tags appear as Python attributes) would still work, while the rest of the code could be compiled with hard offsets for object fields.

The other big problem with Python is that its threading model is no better than C's. Everything is implicitly shared. To make this work, the infamous Global Interpreter Lock is needed. Some implementations split this into many locks, but because the language has no idea of which thread owns what, there's lots of unnecessary locking.

Python is a very pleasant language in which to program. If it ever gets rid of the less useful parts of the "extremely dynamic semantics", sometimes called the Guido von Rossum Memorial Boat Anchor, it could be much more widely useful.

reply

halflings 20 hours ago

link

> The big boat-anchor feature of Python is "setattr", combined with the ability to examine and change most of the program and its objects. [...] If "setattr" were limited to a class of objects derived from a "dynamic object" class, most of the real use cases [...] would still work, while the rest of the code could be compiled with hard offsets for object fields.

Isn't __slots__ made for that specific usecase? (when you want to optimize your code by specifiying specific attribute names)

I think the "dynamic" behavior is a sane default (since most people don't need that optimization).

reply

comex 19 hours ago

link

setattr and similar dynamic features do make Python harder to optimize, but they're not that different from what you see in JavaScript?. JavaScript? has had an incredible amount of work spent optimizing it, but the result is a bunch of pretty damn fast language JITs that implement the full language, usually only slowing down in cases where actual use of those features requires it. Is it really that hard to come up with something like that for Python?

(Threading is a separate issue, though.)

reply

---

http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup

(i wrote that question)

as of this writing, the only answer is:

"Python has a large standard library but only 40 modules (CPython 3.4 REPL on Windows) of it are loaded upon startup (the minimum to set up everything and start running the user program), and a number of those are built-in, i.e. C code compiled into the main binary, which saves on file I/O. In addition, startup time has long been a concern of several core developers, and consequently has been optimized quite a bit."

that makes sense. Another hypothesis i have is:

The parts of Java's stdlib that it needs upon startup are in the rt.jar file. There might be special optimizations applied to this, e.g. class data sharing, which might speed up the Java 'hello world' benchmarks (on Windows and OSX, where it is apparently default: https://github.com/jruby/jruby/wiki/Improving-startup-time ). However, when loading eg Clojure, JRuby, Kawa, Jyython or Scala's stdlib, these optimizations are not available. So, perhaps class file loading and initialization is slow, but this slowness is not seen in the Java rt.jar stdlib. Why is class file loading slow? Well, as http://stackoverflow.com/questions/844143/why-is-the-jvm-slow-to-start#comment652767_844174 and https://en.wikipedia.org/wiki/Java_performance#cite_note-jrecache-27 say, first, having to load a lot of different classfiles, from archives no less, is probably suboptimal. Second, as http://stackoverflow.com/questions/15914094/can-any-clojure-implementation-start-fast#comment27116634_15915993 says " "One limitation of the JVM is that objects must be copied on initialization, "You can't embed any composite constants in byte code. Not even arrays.""; initialization is slow if you have to copy objects upon initialization. Even without special treatment, presumably dynamic languages like Clojure have even more initialization to do than Java, anyways.

So, i guess the design choices here are:

(1) neglecting to provide a way to embed composite constants in bytecode (a choice which afaict has no benefit except implementations simplicity) (2) having an entire dynamic language initialization and core library code written purely in bytecode, as opposed to having crucial parts of it closer-to-the-metal, e.g. the C parts of Python

my suggested remedies are: 1) design bytecode so that composite constants can be embedded -- note that if the bytecode is a standardized interface, this implies making a serialization or 'pickle' format part of the standard 2) provide a way for the constant parts of external libraries to be loaded into memory via loading them once 'the hard way', saving an image of that part of memory, and then mmapping that image. I'm not sure if this should be done at installation time (on the target machine), or at link time (so that these images would be distributed as binaries). I'm guessing installation time. Note that an external library might also want to do some sort of varying initialization procedure each time it starts up (check the machine's IP address, or the current working directory, or the time, or something like that), so it should still get a crack of that after loading and mmapping the stored memory image. I'm not sure how easy it is to image out and mmap in just part of a processes's memory on various target architectures; would the language implementation have to malloc a separate block of memory for each library? If so, is this a problem? 3) make sure it is easy for external libraries to use native code, so that something like Clojure written on top of your language can, if they choose, provide a fast native implementation

(i just added paraphrases of these to whyOot)

---

here's my current guess as to the answer to my question:

http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup/28379292

---

" User interface

Swing has been perceived as slower than native widget toolkits, because it delegates the rendering of widgets to the pure Java 2D API. However, benchmarks comparing the performance of Swing versus the Standard Widget Toolkit, which delegates the rendering to the native GUI libraries of the operating system, show no clear winner, and the results greatly depend on the context and the environments.[70]

In most cases Java suffers greatly from its need to copy image data from one place in memory to another before rendering it to the screen. C++ can usually avoid this large overhead by accessing memory directly. The developers of Java have attempted to overcome this limitation with certain so-called "unsafe" direct memory access classes. However, those attempts fall far short of what C++ natively offers. For example, two major Java OpenGL? implementations suffer tremendously from this data duplication problem which is difficult, if not impossible, to avoid with Java.[citation needed] " -- https://en.wikipedia.org/wiki/Java_performance

---

some stuff i took out of http://stackoverflow.com/questions/28118218/rationale-for-design-choices-causing-jvm-clr-languages-to-have-long-startup before posting it:

If this were the case, why don't JVM implementors simply create a serialization/snapshot of the in-memory representation of the read-in standard library files, and distribute this as part of the JVM executable, rather than loading separate classfiles? ([17][17], [18][18], [19][19]). In fact, maybe they did; this sounds like the JVM's Class Data Sharing ([20][20], [21][21]). But apparently this mechanism only applies to the classes in the Java standard library, and cannot be used to speed up, for example, the Clojure standard libary ([27][27]).

[17]: http://stackoverflow.com/questions/844143/why-is-the-jvm-slow-to-start#comment3973994_844174 [18]: http://nicholaskariniemi.github.io/2014/02/25/clojure-bootstrapping.html [19]: http://nicholaskariniemi.github.io/2014/03/19/solving-clojure-boot-time.html#comment-1715285215 [20]: https://en.wikipedia.org/wiki/Java_performance#Class_data_sharing [27]: http://nicholaskariniemi.github.io/2014/03/19/solving-clojure-boot-time.html#comment-1715285215

more on that soln: https://news.ycombinator.com/item?id=4222977 http://stackoverflow.com/questions/4692076/speed-up-application-start-by-adding-own-application-classes-to-classes-jsa

more startup time benchmarks: https://news.ycombinator.com/item?id=4222900 https://news.ycombinator.com/item?id=2698579

also, benchmarks from my system:

$ time python -c 'print "hello world"' hello world

real 0m0.030s user 0m0.020s sys 0m0.011s

$ time bash -c 'echo hello world' hello world

real 0m0.007s user 0m0.001s sys 0m0.005s

$ time perl -e 'print "hello world"' hello world real 0m0.007s user 0m0.000s sys 0m0.009s

---

http://www.joelonsoftware.com/articles/fog0000000319.html suggests always mallocing in roundup-to-the-neareest-pwr-of-2 amounts, and always reallocing to 2x what the previous size was.

---

static linking and Go: https://news.ycombinator.com/item?id=4195176

---

sometimes if an associative array is small, it may be more efficient to implement it as a list:

https://news.ycombinator.com/item?id=9162458

---

kentonv 3 days ago

Ruby (or maybe specifically Gems) startup is still quadratic in another -- IMO worse -- way: Every require() (that misses cache) searches for the requested file in _every_ dependency package until found. It seems that instead of having the file name itself include the precise package name, Ruby instead treats the packages as a search path that unions a bunch of separate trees into a single logical file tree.

The result is that Ruby does O(n^2) stat() calls at startup, where n is the number of dependency packages. Most of these operations are for files that don't exist, which means they'll miss the filesystem cache and perform absolutely atrociously on top of FUSE, NFS, etc. A typical Ruby app sitting on a loopback FUSE device (i.e. one just mirroring the local filesystem, no network involved) will take minutes to start up, even with aggressive caching enabled. :(

reply

---

didip 8 hours ago

Question for Rust peeps out there:

Is there a documentation on how to build single binary executable for Rust?

reply

steveklabnik 8 hours ago

I'm assuming you're referring to static linking.

Right now, we depend on glibc, and it cannot be statically linked. By default, Rust applications statically link everything but glibc already.

At least one core team member is strongly advocating for musl support for truly standalone binaries in the 1.1 timeframe, but nothing is likely to change for 1.0, and it's too early to decide on 1.1 features yet :)

reply

xyzzy_plugh 8 hours ago

  > Right now, we depend on glibc, and it cannot be statically linked.

Could you expand on this? Why can it not be statically linked?

reply

spystath 8 hours ago

Name resolution libraries are loaded dynamically [1,2], so glibc can't be linked statically if you want NSS to work. I gather there might be a way to explicitly link to a specific resolution library but this is practically unsupported/poorly documented.

[1]: https://sourceware.org/glibc/wiki/FAQ#Even_statically_linked...

[2]: http://stackoverflow.com/questions/3430400/linux-static-link...

reply

xyzzy_plugh 8 hours ago

I understand this limitation. I can currently get around this by using --enable-static-nss.

Does rust currently expose any way to do this? Could I perform linking myself as a workaround?

reply

steveklabnik 8 hours ago

> Could I perform linking myself as a workaround?

You can but it requires lots of black magic, and certainly a nightly rather than beta Rust ;)

http://mainisusuallyafunction.blogspot.com/2015/01/151-byte-... is an old blog post that can get you started.

reply

---

good read:

Python without an operating system http://lwn.net/Articles/641244/

---

here's various options that Nim has that let you override various things to make your binary size super small (possibly useful for embedded systems):

http://hookrace.net/blog/nim-binary-size/

(possibly related: http://www.muppetlabs.com/~breadbox/software/tiny/teensy.html )

oot should probably support many of these options

---

this comment suggests that the Erlang VM is better at latency than even Go: https://news.ycombinator.com/item?id=9670133

---

smegel 1 day ago

I think this would have been much better titled "Boosting NGINX Performance 9x with Asynchronous wrappers around blocking system calls".

Most people when hearing about "thread pools" in the context of a web-server think about using multiple threads for handling separate requests, which is NOT what this is about. It is using threads to enable some blocking syscalls (read and sendfile) to run asynchronously from the main event loop.

There's already a library for that! http://software.schmorp.de/pkg/libeio.html

reply

misframer 1 day ago

Similarly, there's libuv [0], which is used for Node.js and others.

[0] https://github.com/libuv/libuv

reply

rgbrenner 1 day ago

libuv is for C projects that miss the joy of javascript callback hell.

reply

anacrolix 1 day ago

Or don't want to have 10000 threads. I have a Go server that regularly bumps 140k goroutines. Try that shit with native threads.

reply

rgbrenner 1 day ago

libuv isn't unique.. It's equivalent to libev + libeio.. in fact, that's what nodejs used before writing libuv. Whether or not it's faster than those is really case-by-case.. but what you'll definitely get with libuv is callbacks everywhere.

reply

jimjag 20 hours ago

As well as an Apache httpd MPM: Event

reply

---

the Oot equivalent of this should work in the REPL:

raise Exception(u'Error when printing ü')

(works in Python 3, not Python 2)

---

https://news.ycombinator.com/item?id=9983282

bitL 7 hours ago

parent flag

Imagine immutable dict. How many programs in Python you know would still work?

tikhonj 1 hour ago

parent flag

Modern radix trees[1] can be as fast as the hash maps you'd find in a language's standard library while providing additional capabilities (ie fast ordered traversal). Making them immutable and persistent[2] is easy and only adds a bit of performance overhead.

Obviously programs that rely on communicating by mutating parts of a dictionary would have to be rewritten to use an immutable, persistent map—but they could still maintain similar performance.

[1]: Namely "adaptive radix trees" as described in “The adaptive radix tree: Artful indexing for main-memory databases” by V. Leis, A. Kemper, and T. Neumann <http://www3.informatik.tu-muenchen.de/~leis/papers/ART.pdf>

[2]: A persistent adaptive radix tree in Java: https://github.com/ankurdave/part

bitL 1 hour ago

Thanks for the radix tree reference!

On the other hand, stating "they could still maintain similar performance" for immutable dictionaries is misleading at best (you are right for dictionary sizes approaching 0). Anyone can look up benchmarks on the Internet showing real-world performance of immutable data structures and make their mind up if they actually can deliver in their particular use cases.

Mutability could be always used to boost performance on Von Neumann architecture if you know what you are doing. It's just becoming impractical due to increased complexity, pervasive security paranoia and the fact that there are very few people that can get things done right. If you want to contribute to advance your case, please create a competing computer architecture as fast as Von Neumann's, physically possible, but based on immutability.

reply

semi-extrinsic 2 minutes ago

> Mutability could be always used to boost performance (...)

For the extreme case, see the EQUIVALENCE statement in old Fortran.

reply

---

this Redis blog post describes how they make the memory-freeing command async (spawn a background thread to actually free the stuff asynchronously; he calls this 'lazy'):

http://antirez.com/news/93

summary relevant to here:

my takehome:

---

would Oot rather be frugal with memory than fast?

(summary answer: i dunno)

arguments for memory at the expense of speed:

arguments for memory at the expense of speed:

---

may or may not be useful:

http://stackoverflow.com/questions/18459633/how-erlang-processes-access-mailbox-concurrently

---

another regex library:

https://github.com/google/re2

an interesting data structure library:

http://blog.burntsushi.net/transducers/

this data structure is good for efficiently storing and querying (including regexps and finding Levenshtein distance and doing set operations) an ordered set or ordered map when many of the keys share prefixes or suffixes (such as is the case for natural-language words; Lucene uses this for indexing)

---

firefox memory profiling:

" The Memory tool works by taking snapshots of everything in memory, and presenting them as a tree/table with various grouping settings. By default, the contents are grouped by “coarse type,” where each thing in memory falls into one of four classifications:

    Objects: JavaScript objects. Further grouped by each object’s internal [[Class]] name.
    Scripts: The JavaScript source text loaded by the web application and its resulting executable machine code produced by SpiderMonkey’s JIT compiler, IonMonkey.
    Strings: JavaScript strings used by the web application.
    Other: Miscellaneous structures that do not fit in the above categories.

You can also group the snapshot by “object class,” which groups by their JavaScript? [[Object?]] class, or by “internal type,” which groups things by their C++ type names. This latter view is mostly useful for Firefox platform developers.

Perhaps most interesting is the fourth and final grouping option: “allocation stack.” You have to turn this option on manually by ticking the “record allocation stacks” checkbox at the top of the Memory panel, since tracking allocations can degrade the application’s performance while the box is checked. The payoff, however, is worth it: this view groups the things in the heap by the source location in your JavaScript? code. "

---

BenoitP? 2 days ago

The Go runtime is really starting to look sexy. 20 ms GC pauses on 200+ GB heaps!

I remember a thread discussing a pauseless GC on the dev mailing-list; where Gil Tene, of Azul C4's fame, commented on having a clear policy on safepoints from the beginning was paramount to having a good GC. It looked like the community is very biased towards doing the fundamental things well, and attracting the right people.

And on top of that we're get BLAS and LAPACK bindings from https://github.com/gonum

reply

---

 mwcampbell 3 hours ago

A large number of dependencies is only a problem in environments that aren't amenable to per-function static linking or tree-shaking. These include dynamically typed languages like Python, Ruby, and JavaScript? (except when using the Google Closure Compiler in advanced mode), but also platforms like the JVM and .NET when reflection is allowed. Where static linking or tree-shaking is feasible, the run-time impact of bringing in a large library but only using a small part of it is no more than the impact of rewriting the small part you actually use.

Edit: Dart is an interesting case. It has dynamic typing, but it's static enough that tree-shaking is feasible. Seth Ladd's blog post about why tree-shaking is necessary [1] makes the same point that I'm making here.

[1]: http://blog.sethladd.com/2013/01/minification-is-not-enough-... http://blog.sethladd.com/2013/01/minification-is-not-enough-you-need.html reply

nostrademons 2 hours ago

Not always. Dependencies were a huge problem at Google, even in C++ (perhaps especially in C++), because they mean that the linker has to do a lot of extra work. And unlike compiling, linking can't be parallelized, since it has to produce one binary. At one point the webserver for Google Search grew big enough that the linker started running out of RAM on build machines, and then we had a big problem.

There's still no substitute for good code hygiene and knowing exactly what you're using and what additional bloat you're bringing in when you add a library.

reply

---

post on Perl6 implementation noting that Parrot has a problem b/c:

http://www.modernperlbooks.com/mt/2011/05/when-you-lack-cheap-and-easy-polymorphism.html

---

wow look at everywhere Lua JIT works:

http://luajit.org/luajit.html

" Compatibility Windows Linux BSD OSX POSIX Embedded Android iOS PS3 PS4 PS Vita Xbox 360 GCC CLANG LLVM MSVC x86 x64 ARM PPC e500 MIPS Lua 5.1 API+ABI + JIT + BitOp? + FFI Drop-in DLL/.so Overview 3x

---

"

    Fast iteration: “Save my file, hit reload in the browser, see what went wrong, fix it.”

With Hack, this short cycle time had to be preserved.

Implications are huge:

    Type checking algorithms must be fast and mostly incremental
    A persistent daemon maintains type system state
    Filesystem change notifications lead to “instant” type checking
    Typical typechecking time is 200ms (outliers up to 1000ms) for millions of lines of code

" -- http://www.scs.stanford.edu/16wi-cs240h/slides/fb-slides.html#%2843%29

---

" firebones 1 hour ago

Someone posted a great question here that they later deleted. My full response also wasn't submitted (and lost), so I'm trying to recreate part of it here. It was actually a good question about whether Rust's efficiency has been tested.

I've done some testing; Java is faster in terms of the sip test for a lot of stuff, especially for cases where you micro-benchmark single processes without regard to overall CPU or memory usage. However, when you scale up (to the point where JVM tuning comes into play), the raw performance in terms of response time is comparable, while the overall system utilization (e.g., "time" measurements of user/sys) is much better for Rust--Java uses a lot more system time due to JVM housekeeping threads. So more efficient is not a win for Java, even though wall-clock benchmark time might look better for a single process (as long as you don't look at overall system utilization.)

Java's performance comes at the price of more memory usage and higher variance in response time. You can attempt to constrain one or the other, but you can't constrain both at the same time. Microbenchmarks typically put a high enough ceiling on memory that they're relevant for very constrained loads. To be robust beyond the microbenchmark, incremental GC is required, which levels the playing field between Java GC and Rust's jemalloc. Still, even when you attain parity, the Java solution is using far more memory due to the JVM's OOP overhead.

Another example of a hidden "cost" of Rust comes from things like Unicode support. Recently I tested a regex-heavy algorithm that used burntsushi's regex library for Rust. In initial testing, Java blew it away. What I later realized was that the default Java implementation I used did not support Unicode characters. When I enabled that support, and enabled incremental GC (to support the scale of testing I was performing), the performance was similar.

This is another example of Rust being mischaracterized up front. Older languages took short cuts, and fare well for the low end of testing (the microbenchmark). Rust tends to look forward and to the bigger picture. "

---

 [https://github.com/mame/quine-relay One Hundred Language Quine Relay]

" Assuming that runs, my next questions are all common apps/programs that we knew to be especially thorny for one reason or another: Go, strace, tcpdump, systemd, etc.

...I've never gotten a straight answer as to why Go makes system calls directly instead of doing what every other program on the planet does and calling into the system libraries. Certainly, it made the ports to non-Linux systems absolutely brutal: systems are required to make themselves look like Linux for Go to function correctly.

---

tonyarkles 3 days ago

Some experiments I did during my coursework for my M.Sc. indicated that it was actually worse-than-useless at the time (2009?) Here's what would happen:

Maybe it's gotten better since then? I haven't checked recently.

Edit: I wish I had the results handy. The basic conclusion was that you got something like a 1.5x slowdown per additional CPU core. That's not how it's supposed to work! Using taskset to limit a multi-threaded Python process to a single core resulted in significant speedups in the use cases I tried.

reply

nitely 3 days ago

This is a known issue in py2. On py2 when running in a multi-core machine it'll run ~1.8x slower (depending on what you are doing) than it'll run in a single-core machine. Python 3.2 ships a new GIL[0] fixing the problem.

[0] http://www.dabeaz.com/python/NewGIL.pdf

reply

---

saboot 3 days ago

I never quite grasped the actual machinations of Python until I watched Philip Guo's lectures on Python internals.

https://www.youtube.com/watch?v=LhadeL7_EIU

It's a bit long, and definitely over several sittings, but I feel like I really understand Python better and relevant to the post, the complexities and tracking (frames, exceptions, objects, types, stacks, references) that occur behind the curtain which drive Python's slow native performance.

reply

---

... I think a lot of people would be shocked if they sat down and really worked out just how little Python/Perl/etc. code they can run before they webserver noticeably starts slowing down at even small scales.

reply

stickfigure 3 days ago

Agreed.

I spent a little time working with Guido on cache design for the App Engine data access api (ndb). In Java-land, it's a big win to cache datastore entities in memcache because fetch latency is about an order of magnitude faster. In Python-land, caching is a mixed bag - real-world profiling found marginal benefit even with 100% hit rate, and a significant performance disadvantage if the hit rate drops off. This is primarily attributed to pickling overhead.

---

nostrademons 11 hours ago [-]

 There's actually a fairly long history of cross-language VMs, with various degrees of success. What usually happens is that they work fine for languages that look, semantically, basically like the native language on the VM. So LLVM works well as long as your language is mostly like C (C, C++, Objective-C, Rust, Swift). Parrot works if you language is mostly like Perl 6 (Perl, Python, PHP, Ruby). .NET works if your language is mostly like C# (C#, F#, Boo, IronPython). The JVM works if your language is mostly like Java (Java, Clojure, Scala, Kotlin). Even PyPy has frontends for Ruby, PHP, Smalltalk, and Prolog.

"Mostly" in this case means semantically, not syntactically. It's things like concurrency model; floating point behavior; memory layout; FFI; semantics of basic libraries; performance characteristics; and level of dynamism. You can write a Python frontend for both the JVM and CLR, and it will look like Python but act like Java and .NET, respectively, with no guarantee that native CPython libraries will work on it.

---

"Regarding the implementation ((of Goroutines)), I think they're quite similar to the (unfortunately not too well-known) "State Threads" library, just quite lower-level (as Go doesn't rely on libc or things like this and talks directly to the OS kernel) — you could read the introductory paper for the ST library where the concept is quite well explained."

http://state-threads.sourceforge.net/

"Libmill is a library that introduces Go-style concurrency to C." http://libmill.org/

---

Olin Shivers:

"ANF has been the fashionable representation for about ten years now; CPS is out of favor. ...However, just to tell you where I am on this issue, I think the whole movement from CPS to ANF is a bad idea (though Sabry & Felleisen's technical observations and math are as rock solid as one would expect from people of their caliber)."

---

David "Kelsey subsequently spent time as a prof at Northeastern, then left for NEC's prestige lab in Princeton, where he worked on the Kali distributed system. He also got set down precisely on paper something all the CPS people knew at some intuitive level: that the whole "SSA" revolution in the classical compiler community was essentially a rediscovery of CPS. (Harrumph.) " (Olin Shivers, [2])

---

some stack alternatives that i don't understand:

" In what I'll call 'traditional TCO', activation records are allocated on a stack and under certain conditions (tail-calls) one will reuse the activation record. Thus the stack doesn't actually grow (except perhaps if the new activation record is larger than the prior one). As an alternative to the stack the activation records are created on the heap. For performance, this is usually in a clean nursery-region of the heap where one can increment a pointer to perform an allocation. One typically compiles into continuation-passing style, and each continuation therefore serves as a thread. A copying garbage collector ensures there is always clean space for a new nursery. For lack of a better name, I'll call this the 'generic no-stack' design. From the generic no-stack design, one might express a 'generic TCO' which simply reuses the same activation-record - no matter where it is allocated (i.e. on the stack or the heap). However, with CPS this isn't very useful (excepting to avoid a simple allocation) since the continuation is no longer implicit to the stack. In Chicken Scheme, the 'no-stack' approach is adapted to work with C as a target language. The data uses automatic allocation on the C stack, which will grow with each function call... but at a certain size (default 64 kB, IIRC, checked via a pointer comparison) a small copying-GC moves the important items from the C stack to the heap, then a longjump resets the stack with the new continuation. The C stack serves as a nursery, and the C language also implements the various compiled functions. Another variation on the no-stack design is Stallman's 'Phantom Stacks'. Unlike the generic (CPS) variation, one will occasionally return to an older activation record before continuing the computation, thus old activation records represent both history and futures - as per the traditional 'stack'. If a dangerous reference is created (a pointer from heap to stack, or a pointer from earlier in stack to later in stack as per upfunarg) a copying collection is performed immediately and a new stack is created. ... IIRC, Baker also produced something similar, using particular arrangements of pointers - critically, with the stack growing away from the heap - to reduce to a simple comparison the detection of dangerous references. " -- [3]

---

Making TCO explicit

If you are a tenant of the "TCO is just an optimisation" camp, then there is no incentive to make it explicit.

On the other hand, if you believe that TCO is necessary for the semantics of the language (as argued in the original post), then it would make a lot of sense to make it explicit. If nothing else, it would allow the compiler/runtime to tell you "that call is marked as tail-call, but there is no way it could be so go and rewrite this code properly".

The recursive call syntax is so convenient though. I would not trade it for goto. By Denis Bredelet -jido at Fri, 2009-12-04 13:07

login or register to post comments

---

so, in Oot, we want to be passing around tables (graphs) where other languages would be passing class instances. But, we want a structural type system to guarantee that the shape of the table matches this or that pattern. This means that we can represent or almost represent structs by a table that is typed by a pattern that exactly defines the shape of the table.

But now what if we have a table with a type that says it has fields x, y, and z, but doesn't prohibit it from having other fields also? Eg most of the time this is some simple thing that only has the given fields, but sometimes some other code extends it with some other data. We'd like the compiler to still be able to represent this in an efficient way for the common case of accessing and storing only the given fields, rather than just going back to a hashtable.

One way would be to represent it as an ordinary struct, with an extra field (call it 'extra' for now). 'Extra' can then point to another struct with another 'extra', or just to a hashtable. Now the objects has a statically known fixed size in memory, like a struct, and the common case of accessing the initially defined variables can be accomplished without indirection, like a struct.

This is sort of the implementation equivalent of 'row polymorphism' in type systems:

https://brianmckenna.org/blog/row_polymorphism_isnt_subtyping

---

kxyvr 16 days ago

parent [-]on: Higher-order abstract syntax

If anyone want another concrete example of a tagless interpreter in OCaml, I wrote one on SO awhile back:

https://stackoverflow.com/questions/22676975/simple-lambda-c...

Candidly, I find them a pain to write, but it is faster and integrates with the host language's type checker.

---

"as compiler optimizations exploit increasingly recondite properties of the programming language definition, we find ourselves having to program as if the compiler were our ex-wife’s or ex-husband’s divorce lawyer, lest it introduce security bugs into our kernels, as happened with FreeBSD? a couple of years back with a function erroneously annotated as noreturn, and as is happening now with bounds checks depending on signed overflow behavior."

---

hey, a Clojure implementation with low startup latency!

https://anmonteiro.com/2016/11/the-fastest-clojure-repl-in-the-world/

---

lobster_johnson 2 days ago [-]

While I agree wholeheartedly that Erlang's design is better:

To be fair, you can't get Erlang's semantics without also inventing processes and supervisor trees. And those things would be largely meaningless without immutability.

Erlang's magic comes from how all the puzzle pieces fit together into a whole: For example, with immutability comes the ability to always be able to restart from a known state, and probably greatly simplifies the internal implementation of the per-process heap.

Those things fundamentally change the language; Go couldn't adopt Erlang's semantics without eschewing almost everything that makes it "Go".

Error handling is really the least of it. Attempting to replicate some of Erlang's high-level design principles at a quickly lead to roadblocks. For example, goroutines cannot be killed; this means it's impossible to fully implement supervisor trees in Go. It's possible that the runtime could support this at some point, but it's a big can of worms.

reply

solidsnack9000 2 days ago [-]

Immutability probably does simplify a lot of things; but it isn't actually required to get Erlang's model, since Erlang isolates the state of processes from each other. Each process could have its own isolated mutable state as opposed to an immutable one and it wouldn't change the "let it crash" philosophy.

reply

dragonwriter 2 days ago [-]

>Each process could have its own isolated mutable state

Erlang processes do have their own isolated mutable state.

But it's walled off and accessed through calls rather than variable references; in the non-SMP implementationthis enabled copy-free local (same node) message sends (and it still does, but only for large binaries -- the more general use turned out to be less efficient than copying in SMP because of locking issues related to garbage collection and some other implementation details.)

reply

lobster_johnson 2 days ago [-]

You can't guarantee that isolation without immutabilty, or at least forced copying.

In Go, it's extremely easy to accidentally send a pointer (or a data structure that contains a deeply nested pointer somewhere; this includes embedded maps, slices and channels, which are all reference types) on a channel, which is bound to break mutability guarantees at some point.

reply

solidsnack9000 1 day ago [dupe] [-]

Mutable locals do not necessarily imply pointers...

reply

lobster_johnson 1 day ago [-]

No, but how do you pass "mutable locals" to other processes? That's the whole question.

If you pass a "local" on a channel to a different goroutine, Go can't guarantee to the receiver (which now has it as a "local" variable) that the sender (which still has it as a "local") won't change it:

    d := map[string]int{}
    ch <- d
    d["foo"] = 42  // Oops!

The only way to do this safely is to to write a channel abstraction which uses gob or similar to marshal and unmarshal all data, thus copying it and preventing concurrent mutation, and then always use the channel abstraction. But always copying tends to be terrible for performance. Erlang is able to do zero-copy sends to process mailboxes because the data is guaranteed to be immutable.

Sure, you could instead mandate that senders never hold on to data sent on channels, but can you ensure this 100% across a codebase contributed to by an entire team?

reply

klibertp 1 day ago [-]

> Sure, you could instead mandate that senders never hold on to data sent on channels, but can you ensure this 100% across a codebase contributed to by an entire team?

Isn't that precisely what Rust is all about?

reply

lobster_johnson 1 day ago [-]

Pretty much, yes.

reply

---

https://m.phys.org/news/2017-01-middle-popular-yields-more-efficient-parallel.html

" Cilk adds just two commands to C: "spawn," which initiates a fork, and "sync," which initiates a join. That makes things easy for programmers writing in Cilk but a lot harder for Cilk's developers. ... All previous compilers for fork-join languages are adaptations of serial compilers and add the runtime invocations in the front end, before translating a program into an intermediate representation, and thus before optimization. In their paper, the researchers give an example of what that entails. Seven concise lines of Cilk code, which compute a specified term in the Fibonacci series, require the compiler to add another 17 lines of runtime invocations. The middle end, designed for serial code, has no idea what to make of those extra 17 lines and throws up its hands.

The only alternative to adding the runtime invocations in the front end, however, seemed to be rewriting all the middle-end optimization algorithms to accommodate the fork-join model. And to many—including Leiserson, when his group was designing its first Cilk compilers—that seemed too daunting.

Schardl and Moses's chief insight was that injecting just a little bit of serialism into the fork-join model would make it much more intelligible to existing compilers' optimization algorithms. Where Cilk adds two basic commands to C, the MIT researchers' intermediate representation adds three to a compiler's middle end: detach, reattach, and sync.

The detach command is essentially the equivalent of Cilk's spawn command. But reattach commands specify the order in which the results of parallel tasks must be recombined. That simple adjustment makes fork-join code look enough like serial code that many of a serial compiler's optimization algorithms will work on it without modification, while the rest need only minor alterations. "

https://cpc2016.infor.uva.es/wp-content/uploads/2016/06/CPC2016_paper_12.pdf

---

jimrandomh 14 hours ago [-]

Non-Rust-user here. I tried Rust about a year ago and gave up on it as not ready. Unfortunately, I don't think this roadmap addresses the problems that made me reject it the first time around.

The main problem I have with Rust is that it's not up to the task of dealing with C APIs, particularly POSIX. The issue I got stuck on was https://www.reddit.com/r/rust/comments/47a0s3/dealing_with_v... . There are two possible things Rust could do that would make me give it a second chance: either commit to maintaining bindings for all of POSIX and libc as part of Rust core, or fold rust-bindgen into core and get it into shape.

Creating higher-level libraries like Tokio is nice for some use cases, but right now Rust doesn't have a working safety valve for things C can do that Rust can't. This greatly magnifies problems like lack of bindings for select() (see http://esr.ibiblio.org/?p=7294&cpage=1 ; I too ran into that problem, and lost a few days to it.)

reply

steveklabnik 14 hours ago [-]

Did you try the nix crate? It looks like maybe not. It should have all of that stuff already wrapped for you. For example: https://docs.rs/nix/0.7.0/nix/sys/termios/fn.tcgetattr.html (select is in there too https://docs.rs/nix/0.7.0/nix/sys/select/fn.select.html )

rust-bindgen has been improving a ton, and it is in fact on that roadmap, under

> Integration with other languages, running the gamut from C to JavaScript?.

reply

jimrandomh 14 hours ago [-]

Happy to hear bindgen is getting attention; I didn't actually notice that when I looked through the roadmap.

I didn't try the nix crate at the time, but looking at it just now - it doesn't solve the portability issue. It defines struct Termios in https://github.com/nix-rust/nix/blob/master/src/sys/termios.... , with something #ifdef-ish branching on operating system, but not on CPU architecture. On quick inspection, I think it's probably incorrect on x86-32, and this crate is definitely a major liability for portability.

reply

---

more on hash tables:

https://probablydance.com/2017/02/26/i-wrote-the-fastest-hashtable/ https://news.ycombinator.com/item?id=13742865

" One approach I've read about is to keep a metric for the number of probes as it relates to the overall capacity of the table. If a single bucket starts to fill up while the table itself is at relatively low capacity it is assumed that the data structure is under attack and it is rehashed with a DoS? resistant hash function. This allows a fast, but non-resistant, hash function to be used most of the time without incurring a huge penalty in the pathological case." bradleyjg [4] on rehashing: https://news.ycombinator.com/item?id=13743211

---

tmyklebu 7 hours ago [-]

In modern gcc, std::variant doesn't look like a competent replacement for old-fashioned tagged unions.

On x86_64 Linux, it looks like a function with signature `void f(std::variant<int, char>)` expects its (8-byte) argument to be passed by reference, whereas `void f2(tagged_union_of_int_and_char)` passes its argument in rdi.

gcc (-O3) also generates miserable code for calling f(42):

  4004f0:       48 83 ec 18             sub    $0x18,%rsp
  4004f4:       48 89 e7                mov    %rsp,%rdi
  4004f7:       c7 04 24 2a 00 00 00    movl   $0x2a,(%rsp)
  4004fe:       c6 44 24 04 00          movb   $0x0,0x4(%rsp)
  400503:       e8 c8 ff ff ff          callq  4004d0 <f>
  400508:       48 83 c4 18             add    $0x18,%rsp
  40050c:       c3                      retq

as compared with calling f2(42):

  400510:       48 bf 00 00 00 00 2a    movabs $0x2a00000000,%rdi
  400517:       00 00 00
  40051a:       eb c4                   jmp    4004e0 <f2>

std::optional seems to have the same disease; the ABI is different from a plain struct containing a bool and the value and you get worse initialisation code. There's also a build time penalty to using std::optional; on my machine, including the code that uses optionals makes my trivial test program take 320ms to compile and link rather than 50ms.

These look like yet more new C++ features that are worse than what they're trying to replace.

reply

gpderetta 3 hours ago [-]

Unfortunately the abi requires non-trivially copyable/destructible types to be passed by (hidden) reference.

It should be possible to implement std::variant in such a way that it is trivially constructible and destructible as long as all elements​ are, but apparently stdlibc++ doesn't. This is GCC choice and other compilers apparently do differently. Unless GCC changes it before finalising their c++17 implementation, the choice will be unfortunately set in stone for them as they guarantee abi stability of the library.

P0602R0 is a proposal to require trivial copy/move (but strangely not destruct), don't know whether it has been accepted though.

reply

ambrop7 3 hours ago [-]

Right, this is basically what I said in my comment :) Thanks for the proposal reference.

reply

---

[5]

---

" Your best bet is to avoid trying to build your own system out of primitives, and instead use locking unless it really shows up as a hot spot when profiling. (If you think you can be clever and avoid locks, don't. You aren't. That's the general "you" which includes me and everybody else.) You should at minimum use a spin lock, see spinlock(3). And whatever you do, don't try to implement "your own" locks. You will get it wrong.

Ultimately, you need to use whatever locking or atomic operations your operating system provides. Getting these sorts of things exactly right in all cases is extremely difficult. Often it can involve knowledge of things like the errata for specific versions of specific processor. ("Oh, version 2.0 of that processor didn't do the cache-coherency snooping at the right time, it's fixed in version 2.0.1 but on 2.0 you need to insert a NOP.") Just slapping a volatile keyword on a variable in C is almost always insufficient.

On Mac OS X, that means you need to use the functions listed in atomic(3) to perform truly atomic-across-all-CPUs operations on 32-bit, 64-bit, and pointer-sized quantities. (Use the latter for any atomic operations on pointers so you're 32/64-bit compatible automatically.) That goes whether you want to do things like atomic compare-and-swap, increment/decrement, spin locking, or stack/queue management. Fortunately the spinlock(3), atomic(3), and barrier(3) functions should all work correctly on all CPUs that are supported by Mac OS X. " -- [6]

---

zzalpha 18 hours ago [-]

Clojure startup times are brutal, and simply not workable in the mobile environment where Android lives.

reply

retrogradeorbit 16 hours ago [-]

https://github.com/drapanjanas/re-natal

reply

---

" As another example of the clash of paradigms, although Rust supports systems-and-arrays programming, it has renounced C-style for loops (initialize, test, increment) in favor of The Iterator; the documentation goes even further, renouncing indexed accesses altogether in favor of index-less Iteration. Iterators are a great and reusable way to encapsulate your blah-blah-blah-blah-blah, but of course they’re impossible to optimize. The value of .next() remains a kind of holy mystery shrouded behind a non-pure function call, so anytime you have an iterator-style loop, the compiler is going to be left twiddling its thumbs.

As a simple demonstration, here’s an idiomatic Rust program that uses an iterator:

fn main() { let mut sum = 0.0; for i in 0..10000000 { sum += i as f64; } println!(“Sum: {}”, sum); }

It runs about five times slower than the equivalent program that uses a while loop:

fn main() { let mut sum = 0.0; let mut i = 0; while i < 10000000 { sum += i as f64; i += 1; } println!(“Sum: {}”, sum); }

The emphasis on iterators causes me extra sadness as Rust has contiguous-memory, natively-typed arrays, which would otherwise be perfect candidates for optimization, and Rust uses LLVM on the backend, which can optimize loops eight ways to next Sunday; but all that is thrown out the window in favor of The Noble Iterator. On the other hand, there’s a SIMD module in the standard library, so maybe they care about CPU design after all? I’m not sure. Rust feels like it’s being pulled between the demands of efficiency (exemplified by its native-everything and choice of LLVM) and the modern definition of beautiful code (which I take to be, “lots of chained function calls”). "

note: this is addressed in [7] and other comments in that thread; it seems that actually, iterators optimize fine in Rust, and the original poster simply didn't try compiling at a high optimization.

---

" Rust has lazy evaluation of lists, so you can do things like create an infinite list, as in the following:

let y = (1..)

…then chain together “iterator adaptors,” like this example from the documentation:

for i in (1..).step_by(5).take(5) { println!(“{}”, i); }

I’ve personally never needed a list larger than the size of the known universe, but to each his own. Sipping from the chalice of infinity may feel like you’re dining beneath the golden shields of Valhalla, but if you’re coming from not-a-Haskell, you might be surprised at what code actually gets executed, and when. But then Haskell programmers won’t necessarily feel at home, either; Rust does not have tail-call optimization, or any facilities for marking functions as pure, so the compiler can’t do the sort of functional optimization that Haskell programmers have come to expect out of Scotland. "

---

more on I/O lists for strings:

http://www.evanmiller.org/elixir-ram-and-the-template-of-doom.html

---

some notes on which system clock to use:

http://pzemtsov.github.io/2017/07/23/the-slow-currenttimemillis.html https://news.ycombinator.com/item?id=14928003 https://news.ycombinator.com/item?id=14929619

---

DiThi? 3 hours ago [-]

> implicit returns

In my experience it has only been a problem when porting existing code bases. We only had to be a bit careful putting a return after a for loop at the end of a function so it doesn't build a list unnecessarily (otherwise implicit returns of unused random values don't hurt). For clarity, one of the few rules we have is to use return explicitly when a function is more than a single expression.

---