notes-computer-jasper-jasperCompiler

jasper compiler design principals

also for the interpreter and runtime


details

requirements for the compiler, interpreter, and IDE, and related

if there is a type inference error, the compiler tells you all the places that might be wrong, not just the last point in the deduction of the inconsistency. The IDE can further narrow down the likely culprit by seeing which of these locatations have been recently modified.

The compilier should provide everything an IDE needs to do code-completion (e.g. incremental parsing, what sorts of things/what types would be syntactically allowed at this location, what are the identifiers in scope at this location of that type, type inference), and should also provide primitive code-completion (e.g. 'this partial word was typed here, what are valid completions?)').

The compiler should be able to partially compile the more advanced language contstructs into simpler ones for beginners; like Hypercard, you should be able to use the language with training wheels, specifying your 'difficulty level'. E.g. the compiler can compile down to 'core Jasper' for you. Or, it can compile out whatever weird metaprogramming constructs your coworker added; and then compile them back in when you're done with the code (these operations are bidirectional when possible; when not, a std annotation system that makes it so, and that helps IDEs etc maintain generated code, is built-in)

the compiler can make a diagram of the call stack or a class hierarchy or what have you

the compiler should handle 'indexing' e.g. 'show me all the callers of Foo'

the compiler should have a framework to allow modules to check for certain common forms of infinite loops and memory leaks

The intermediate representations in the compiler should be visible and an API provided to hook in new processing stages in between. When possible, the compiler can tell you which location in an upstream representation corresponds to a given location in a downstream one (and vice versa).

the interpreter should be able to tell you at any time where a symbol came from, including the path to the file if it came from a file.

the compiler, interpreter, and IDE should be written in Jasper.

profiling both time and memory (including an explanation of exactly which constructs are in memory, where they were created, and why each one has not yet been deallocated) should be included. debugging should be included, including inspection of thunks.

the system should interoperate with emacs, vi, and Eclipse.

there shall be no Global Interpreter Lock.

pre- and post-tokenization and lexing macros

the intepreter shall have ipython-like features, such as querying to read the docstrings and file source of any function or object

the compiler can switch between optional whitespace layout mode and parenthetical (or curly braces) mode and will reformat your source code for you. it will reformat between various other styles, obviating the need for style wars.

compiler must be fast (unlike Scala's, for which i've seen many complaints)

low latency startup time

jumping from one library to another, to or from C, should not be much more expensive than a C++ virtual method call

bounds checks should be disableable per-block

The compiler should consider adhering to Yegge's Grok API if it is ever published:

" ...static types yield better toolchain support. This is undeniably true today, and I have made it my life's work to ensure that it is not true tomorrow.

I have spent the last four years championing an initiative within Google called the "Grok Project", one that will at some point burst beyond our big walled garden and into your world. The project's sole purpose in life is to bring toolchain feature parity to all languages, all clients, all build systems, and all platforms.

(Some technical details follow; feel free to skip to the next section heading...)

My project is accomplishing this lofty and almost insanely ambitious goal through the (A) normative, language-neutral, cross-language definitions of, and (B) subsequent standardization of, several distinct parts of the toolchain: (I) compiler and interpreter Intermediate Representations and metadata, (II) editor-client-to-server protocols, (III) source code indexing, analysis and query languages, and (IV) fine-grained dependency specifications at the level of build systems, source files, and code symbols.

OK, that's not the whole picture. But it's well over half of it.

Grok is not what you would call a "small" project. I will be working on it for quite some time to come. The project has gone through several distinct lifecycle phases in its four years, from "VC funding" to "acceptance" to "cautious enthusiasm" to "OMG all these internal and even external projects now depend critically on us." Our team has recently doubled in size, from six engineers to twelve. Every year -- every quarter -- we gain momentum, and our code index grows richer.

Grok is not a confidential project. But we have not yet talked openly about it, not much, not yet, because we don't want people to get over-excited prematurely. There is a lot of work and a lot of dogfooding left to do before we can start thinking about the process for opening it up.

For purposes of this essay, I'll assert that at some point in the next decade or so, static types will not be a prerequisite for world-class toolchain support. "

http://bsumm.net/2012/08/11/steve-yegge-and-grok.html

it should be able to tell you from which package any given symbol was defined (so that we dont need to make pythoneque 'from p import s', you can just say 'imp p', because itll be ez for the reader to find out that s came from p

should exit with nonzero code if exiting because of exception. this should be documented. Consider using POSIX exit codes: http://stackoverflow.com/questions/6413831/exit-code-standards-in-python

- SYSEXITS.H -- Exit status codes for system programs. - - This include file attempts to categorize possible error - exit statuses for system programs, notably delivermail - and the Berkeley network. - - Error numbers begin at EX__BASE to reduce the possibility of - clashing with other exit statuses that random programs may - already return. The meaning of the codes is approximately - as follows: - - EX_USAGE -- The command was used incorrectly, e.g., with - the wrong number of arguments, a bad flag, a bad - syntax in a parameter, or whatever. - EX_DATAERR -- The input data was incorrect in some way. - This should only be used for user's data & not - system files. - EX_NOINPUT -- An input file (not a system file) did not - exist or was not readable. This could also include - errors like "No message" to a mailer (if it cared - to catch it). - EX_NOUSER -- The user specified did not exist. This might - be used for mail addresses or remote logins. - EX_NOHOST -- The host specified did not exist. This is used - in mail addresses or network requests. - EX_UNAVAILABLE -- A service is unavailable. This can occur - if a support program or file does not exist. This - can also be used as a catchall message when something - you wanted to do doesn't work, but you don't know - why. - EX_SOFTWARE -- An internal software error has been detected. - This should be limited to non-operating system related - errors as possible. - EX_OSERR -- An operating system error has been detected. - This is intended to be used for such things as "cannot - fork", "cannot create pipe", or the like. It includes - things like getuid returning a user that does not - exist in the passwd file. - EX_OSFILE -- Some system file (e.g., /etc/passwd, /etc/utmp, - etc.) does not exist, cannot be opened, or has some - sort of error (e.g., syntax error). - EX_CANTCREAT -- A (user specified) output file cannot be - created. - EX_IOERR -- An error occurred while doing I/O on some file. -*/

+ * EX_USAGE -- The command was used incorrectly, e.g., with + * the wrong number of arguments, a bad flag, a bad + * syntax in a parameter, or whatever. + * EX_DATAERR -- The input data was incorrect in some way. + * This should only be used for user's data & not + * system files. + * EX_NOINPUT -- An input file (not a system file) did not + * exist or was not readable. This could also include + * errors like "No message" to a mailer (if it cared + * to catch it). + * EX_NOUSER -- The user specified did not exist. This might + * be used for mail addresses or remote logins. + * EX_NOHOST -- The host specified did not exist. This is used + * in mail addresses or network requests. + * EX_UNAVAILABLE -- A service is unavailable. This can occur + * if a support program or file does not exist. This + * can also be used as a catchall message when something + * you wanted to do doesn't work, but you don't know + * why. + * EX_SOFTWARE -- An internal software error has been detected. + * This should be limited to non-operating system related + * errors as possible. + * EX_OSERR -- An operating system error has been detected. + * This is intended to be used for such things as "cannot + * fork", "cannot create pipe", or the like. It includes + * things like getuid returning a user that does not + * exist in the passwd file. + * EX_OSFILE -- Some system file (e.g., /etc/passwd, /etc/utmp, + * etc.) does not exist, cannot be opened, or has some + * sort of error (e.g., syntax error). + * EX_CANTCREAT -- A (user specified) output file cannot be + * created. + * EX_IOERR -- An error occurred while doing I/O on some file. + * EX_TEMPFAIL -- temporary failure, indicating something that + * is not really an error. In sendmail, this means + * that a mailer (e.g.) could not create a connection, + * and the request should be reattempted later. + * EX_PROTOCOL -- the remote system returned something that + * was "not possible" during a protocol exchange. + * EX_NOPERM -- You did not have sufficient permission to + * perform the operation. This is not intended for + * file system problems, which should use NOINPUT or + * CANTCREAT, but rather for higher level permissions. + */

http://stackoverflow.com/a/1101969/171761

option to create a single statically linked binary

search-and-replace with confirmation through AST, e.g. 'find every instance of a direct reference to function f and ask me if i'd like to replace it with one that gives the new keyword param kp=3', e.g. change f(1,2) to f(1,2,kp=3)'

there is a notion of an 'optimizing program transformation' (optimizing macro), which is one which promises not to change the semantics of the program. They are given to the compiler in a file folder. each one tells the compiler about any dependencies it has on other optimizing macros (ones which MUST come before it), and also a request of which other optimizing macros SHOULD come before it, and then also which ones MUST come after it and which ones and SHOULD come after it.

writing to a global variable is a type of taint. the compiler can tell you if anything in a file or a namespace posseses this taint.

---

List of syntax / autocomplete libraries to potentially integrate with

---

something like gofmt

---

the compiler should be able to, if pointed to any instance of a symbol in a file, to return where that symbol was defined, whether in that file or in another file imported from that one (and in the latter case, to give both the file immediately imported which provided the symbol, and also, by recursively applying this procedure, to give the original definition of the symbol)

--

compiler should be able to say, at any point, which exceptions might be thrown beneath that point

--

list of useful tools from one of the erlang dialyzer guys:

http://www.slideshare.net/konstantinvsorokin/kostis-sagonas-cool-tools-for-modern-erlang-program-developmen

--

semantic versioning, EXCEPT:

since any change in Jasper program behavior, even if it is correcting a compiler bug, might be a breaking change to someone, breaking changes are allowed if they are changing 'unintended' program behavior, or program behavior that was intended to be left undefined.

So:

--

i like how in irb, if you say a = b, it prints out the new value of a

---

UTF-8 source code files. all language-defined syntax and standard libraries are solely US-ASCII however.

http://blog.backblaze.com/2008/12/15/10-rules-for-how-to-write-cross-platform-code/

---

for versioning, maybe by default ship a compiler (and maybe even distro, with libs and all) that can do all old versions of the language. So e.g. with python, there would be no problem having some programs that use python 2 and others that use python 3 on the same system, or some that use 2.4 and others than use 2.6, etc, and make each program indicate version. Alternately, force the version to be in the script name, e.g. there is no 'jasper', only a 'jasper0.00001', etc. But maybe for shell one-liners include a 'jasp' for jasper latest.

--

it's critical to design the language, the tools, and the community process so as to (a) not have releases with too many major bugs (people say that D had some trouble with this early on), (b) not break everything too often with dependency hell (people say that the Ruby on Rails ecosystem has trouble with this), but also (c) allow the language developers to make major, backwards-incompatible changes relatively quickly (e.g. the transition from Python 2 -> Python 3 has gone on for some many years that people are starting to look down on Python a little bit because it seems to lack forward momentum, even though the language itself is still very popular and growing), while (d) supporting a single canonical implementation, rather than only a canonical spec (to prevent fragmentation such as seen in Lisp, and in web browsers), and (e) providing a management process for that implementation that ensures that new major versions are actually implemented relatively quickly (as opposed to the Perl6 fiasco) and (f) allowing users to install multiple versions of the language on their systems concurrently with little pain (as opposed to the way Python used to be on some distros)

--