Bayle Shanks's website: proj-plbook-plChSyntax

Table of Contents for Programming Languages: a survey

Chapter : syntax

character sets

hard-to-type characters

basic arithmetic

on a traditional american keyboard, many unshifted punctuation characters have obvious arithmetic meanings (+,-*,/,^,=). should a language stick with these, or use those for something more common and use other things for arithmetic?

identifier lexical syntax

matching pairs of symbols

quite useful for nested grouping

/\, <>, [], {}, () any others in ASCII?

symmetric vs asymmetric symbols

significant whitespace

EOL as statement separator

e.g. instead of ';' in C

one-liner support

need a way to replace any function using EOLs. usually semicolons for EOL and braces for grouping blocks.

some languages e.g. Python don't support one-liners. other languages e.g. haskell support both a significant whitespace mode (haskell calls this 'layout') and one-liners

function definition

..and lambdas

need to separate arguments from each other, and all arguments from function body

some languages (e.g. MATLAB) also define return values

some languages (e.g. MATLAB) have a syntax that conveniently allows you to copy-and-paste the function definition into your function call site to get started, and then change the name of variables from there

multiple return values

tuples

keyword args (named args)

keyword return args (named return args)

note that sometimes you don't have to re-list all of the named return args in a statement like 'return', they are implicit because you can assign to variables with the given return arg names. Eg Golang: https://tour.golang.org/basics/7

default args

fn application

f() or prefix or haskell

infix ops

fixed or custom

infixify a prefix op

section an infix operator to get a prefix function

associativity

When operators are of equal precedence, how do we parse them?

If the operators are "non-associative", that means that this situation is a syntax error; the programmer must explicitly group non-associative operators of equal precedence.

If the operators are "associative", this means that they have the same meaning no matter how we parethesize them, so it doesn't matter. This is a mathematical (algebraic) property of the operators, not a syntax choice (although a programming language may or may not explicitly recognize this property).

If the operators are "left associative" then "a b c" is parsed as "((a b) c)". If the operators are "right associative" then "a b c" is parsed as "(a (b c))".

Left associativity is useful for curried-style argument passing to functions: eg "f a b" is parsed as "(f a) b", which is the curried version of f(a,b) (see currying, todo). A programming language example is Haskell; Haskell function application is left-associative

Right associativity is useful for function composition or one-liner 'pipelines': eg "g f x" is parsed as "g (f x)". A programming language example is the APL-derivative "J"; J is right-associative.

Examples of mathematical associativity

examples of commutative, associative functions:

examples of non-commutative, non-associative functions:

examples of non-commutative, associative functions:

string concatenation
function composition
matrix multiplication

examples of commutative, non-associative functions:

averaging
NAND
NOR
rock-paper-scissors operator; returns the winner of a match; eg (paper rock) scissors= paper scissors = scissors, but paper (rock scissors)= paper rock = paper. [1]
see [2] and [3] for more

(infix) precedence

traditional precedence table

todo; but *,/ > +,0 > comparisons > boolean ops

if too many levels, programmers can't make us of it anyway because they can't remember the precedence and it's easier to use parens than to look it up; but they will still pay a price when they see someone else's code without parens and they must look it up to parse it custom precedence is perhaps an extreme version of this

if too few levels, programmers must use lots of parens extreme version of this: no infix e.g. lisp

examples and comparisons of precedence in various languages:

promotion

partial fn application

currying

variadic

and variadic keywords

glob expansion

homoiconicity

compile-time conditionals

like C's #ifdef, Nimrod's "when" http://nimrod-code.org/tut1.html#when-statement

destructuring bind

sugar for assignments

e.g. 'x = 3' instead of '(let x 3)

let

" Let/Const

The ES6 ‘let’ feature is similar to ‘var’, but it aims to simplify the mental model for the variable’s scope. With ‘let’, you can scope variables to blocks of code rather than whole functions. For example:

function f() { let total = 0; let x = 5; for (let x = 1; x < 10; x++) { total += x; } console.log(x); }

f(); outputs 5

Notice how the two ‘let x’ statements do not conflict. This is because the one used in the loop is in a different scope than the one outside of the loop. If we re-wrote this using vars, all ‘x’ vars effectively combine into one, leading to rather confusing output.

function f() { var total = 0; var x = 5; for (var x = 1; x < 10; x++) { total += x; } console.log(x); }

f(); outputs 10 " -- http://blogs.msdn.com/b/typescript/archive/2015/01/16/announcing-typescript-1-4.aspx

complex l-values or similar

an l-value is the thing on the left of the equals sign in an assignment statements using 'a = b' syntax. e.g. in 'x = 3', the l-value is 'x'.

A complex l-value is when a location must be resolved instead of being immediately given, e.g. "x[2] = 3"

a similar thing is how in Common Lisp you can do (assuming 'behavior' is a property list, that is, a dict/map/hash):

" (setf (get ’dog ’behavior)

’(lambda () (wag-tail) (bark))) "

here, instead of resolving (get ’dog ’behavior) to a value immediately, setf takes the LOCATION being read by (get ’dog ’behavior), and then assigns to that location.

expression evaluation order

in C, mostly undefined, except for short-circuit operators

e.g. in C code like "a = b() + c()" can call b() and c() in any order. If they have side effects then this might matter, yet no compiler error is given. However, the evaluation order of a() && b() IS specified.

implicit parameter special variable names

e.g. in bash, $0 is the name of the current script, $1 is the first positional parameter that was passed in, $2 is the second positional parameter, etc

in some languages a similar scheme for special variables for positional parameters is at the function level

blocks as first-class functions

e.g. ruby: ruby blocks

e.g. Apple Swift: blocks e.g. "let sortedCities = sort(cities) { $0 < $1 }"

e.g. Objective C: http://arstechnica.com/apple/2009/08/mac-os-x-10-6/10/#blocks

(note: goes well with implicit parameter special variable names)

conditionals and assignments

e.g. in C

http://stackoverflow.com/questions/151850/why-would-you-use-an-assignment-in-a-condition

however this can cause confusion between = and ==:

https://www.securecoding.cert.org/confluence/display/cplusplus/EXP19-CPP.+Do+not+perform+assignments+in+conditional+expressions

e.g. in Apple Swift:

" if let indexOfLondon = find(sortedCities, "London") { println("London is city number \(indexOfLondon + 1) in the list") } "

trailing conditionals

e.g. "print 'big' if x > 3"

explicit type conversion

some choices:

"constructor syntax": e.g. String(2) == "2"

"conversion function syntax": conv(2, Int, String)

"generic conversion-to function syntax": conv(2, String)

no syntax, and conversion functions: intToString(2)

no syntax, and generic conversion-to functions: toString(2)

no syntax, and generic conversion-from functions: fromInt(2) :: String (note: in languages with type inference the destination type may be inferred instead of annotated, making this skirt the line between implicit and explicit)

note: the following is actually implicit, not explicit: "annotation syntax": 2 :: String

the dangling else issue

http://en.wikipedia.org/wiki/Dangling_else

* for infinity

In Perl6, '*' is sometimes used for the constant 'infinity', for example:

  my $new-password = ('a'..'z','A'..'Z',0..9).roll(12).join; # 'ygHIHbi4XgUV'
  my @dice-rolls := ('⚀'..'⚅').roll(*); # infinite list of dice rolls

  my @deck = 2..10,<J Q K A> X~ <♡ ♢ ♣ ♠>;
  my @shuffled = @deck.pick(*);

(thanks [4])

pronouns/'current' variable

In Perl, "$_" is the 'current' variable, functioning similar to the pronoun 'it' in English. Eg:

( .say ) is short for ( $_.say )
prints the first 6 numbers (0..5) each on it's own line for 0..5 { .say }

Common functions

Logical operations

Links:

http://rosettacode.org/wiki/Logical_operations

misc thoughts

" Nowadays we have a principle in Perl, and we stole the phrase Huffman coding for it, from the bit encoding system where you have different sizes for characters. Common characters are encoded in a fewer number of bits, and rarer characters are encoded in more bits.

We stole that idea as a general principle for Perl, for things that are commonly used, or when you have to type them very often – the common things need to be shorter or more succinct. Another bit of that, however, is that they’re allowed to be more irregular. In natural language, it’s actually the most commonly used verbs that tend to be the most irregular.

And there’s a reason for that, because you need more differentiation of them. One of my favourite books is called The Search for the Perfect Language by Umberto Eco, and it’s not about computer languages; it’s about philosophical languages, and the whole idea that maybe some ancient language was the perfect language and we should get back to it.

All of those languages make the mistake of thinking that similar things should always be encoded similarly. But that’s not how you communicate. If you have a bunch of barnyard animals, and they all have related names, and you say “Go out and kill the Blerfoo”, but you really wanted them to kill the Blerfee, you might get a cow killed when you want a chicken killed.

So in realms like that it’s actually better to differentiate the words, for more redundancy in the communication channel. The common words need to have more of that differentiation. It’s all about communicating efficiently, and then there’s also this idea of self-clocking codes. If you look at a UPC label on a product – a barcode – that’s actually a self-clocking code where each pair of bars and spaces is always in a unit of seven columns wide. You rely on that – you know the width of the bars will always add up to that. So it’s self-clocking.

There are other self-clocking codes used in electronics. In the old transmission serial protocols there were stop and start bits so you could keep things synced up. Natural languages also do this. For instance, in the writing of Japanese, they don’t use spaces. Because the way they write it, they will have a Kanji character from Chinese at the head of each phrase, and then the endings are written in the a syllabary. " -- http://www.linuxvoice.com/interview-larry-wall/

" The syntax needs to be something your target audience will like.

Trying to go with something they've not seen before will make language adoption a much tougher sell.

I like to go with a mix of familiar syntax and aesthetic beauty. It's got to look good on the screen. After all, you're going to spend plenty of time looking at it. If it looks awkward, clumsy, or ugly, it will taint the language.

There are a few things I (perhaps surprisingly) suggest should not be considerations. These are false gods:

    Minimizing keystrokes. Maybe this mattered when programmers used paper tape, and it matters for small languages like bash or awk. For larger applications, much more programming time is spent reading than writing, so reducing keystrokes shouldn't be a goal in itself. Of course, I'm not suggesting that large amounts of boilerplate is a good idea.
    Easy parsing. It isn't hard to write parsers with arbitrary lookahead. The looks of the language shouldn't be compromised to save a few lines of code in the parser. Remember, you'll spend a lot of time staring at the code. That comes first. As mentioned below, it still should be a context-free grammar.
    Minimizing the number of keywords. This metric is just silly, but I see it cropping up repeatedly. There are a million words in the English language, I don't think there is any looming shortage. Just use your good judgment.

Things that are true gods:

    Context-free grammars. What this really means is the code should be parsable without having to look things up in a symbol table. C++ is famously not a context-free grammar. A context-free grammar, besides making things a lot simpler, means that IDEs can do syntax highlighting without integrating most of a compiler front end. As a result, third-party tools become much more likely to exist.
    Redundancy. Yes, the grammar should be redundant. You've all heard people say that statement terminating ; are not necessary because the compiler can figure it out. That's true — but such non-redundancy makes for incomprehensible error messages. Consider a syntax with no redundancy: Any random sequence of characters would then be a valid program. No error messages are even possible. A good syntax needs redundancy in order to diagnose errors and give good error messages.
    Tried and true. Absent a very strong reason, it's best to stick with tried and true grammatical forms for familiar constructs. It really cuts the learning curve for the language and will increase adoption rates. Think of how people will hate the language if it swaps the operator precedence of + and *. Save the divergence for features not generally seen before, which also signals the user that this is new.

As always, these principles should not be taken as dicta. Use good judgment. Any language design principle blindly followed leads to disaster. The principles are rarely orthogonal and frequently conflict. It's a lot like designing a house — making the master closet bigger means the master bedroom gets smaller. It's all about finding the right balance. " -- So You Want To Write Your Own Language? By Walter Bright

Very useful stats:

http://xahlee.info/comp/computer_language_char_distribution.html

---

UFCS

https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax

---

Regex syntax

Links:

https://github.com/oilshell/oil/wiki/Alternative-Regex-Syntax
- discussion: https://lobste.rs/s/molbhc/alternative_regex_syntax

POSIX ERE regex syntax

PCRE regex syntax

Eggex syntax

http://www.oilshell.org/release/latest/doc/eggex.html#backtracking-constructs-use-discouraged

Oil shell has a new regex syntax that it calls 'eggex'. It marks potentially backtracking constructs with a prefix '!'.

---

Minimal syntaxes

"...several existing well-understood design families for minimal syntax: Lisp-like, Forth-like, APL-like" [5]

---

Unicode security issues

---

https://krebsonsecurity.com/2021/11/trojan-source-bug-threatens-the-security-of-all-code/

---

" Specifically, the weakness involves Unicode’s bi-directional or “Bidi” algorithm, which handles displaying text that includes mixed scripts with different display orders, such as Arabic — which is read right to left — and English (left to right).

But computer systems need to have a deterministic way of resolving conflicting directionality in text. Enter the “Bidi override,” which can be used to make left-to-right text read right-to-left, and vice versa.

“In some scenarios, the default ordering set by the Bidi Algorithm may not be sufficient,” the Cambridge researchers wrote. “For these cases, Bidi override control characters enable switching the display ordering of groups of characters.”

Bidi overrides enable even single-script characters to be displayed in an order different from their logical encoding. As the researchers point out, this fact has previously been exploited to disguise the file extensions of malware disseminated via email.

Here’s the problem: Most programming languages let you put these Bidi overrides in comments and strings. This is bad because most programming languages allow comments within which all text — including control characters — is ignored by compilers and interpreters. Also, it’s bad because most programming languages allow string literals that may contain arbitrary characters, including control characters. ... “Therefore, by placing Bidi override characters exclusively within comments and strings, we can smuggle them into source code in a manner that most compilers will accept. Our key insight is that we can reorder source code characters in such a way that the resulting display order also represents syntactically valid source code.”

    “Bringing all this together, we arrive at a novel supply-chain attack on source code. By injecting Unicode Bidi override characters into comments and strings, an adversary can produce syntactically-valid source code in most modern languages for which the display order of characters presents logic that diverges from the real logic. In effect, we anagram program A into program B.”"

---

https://certitude.consulting/blog/en/invisible-backdoor/ discussion: https://news.ycombinator.com/item?id=29170954

---

Hare Compound Expressions

https://harelang.org/tutorials/introduction/#using-yield

---

Gleam use expressions

https://gleam.run/news/v0.25-introducing-use-expressions/

This is related to monadic "do" notation in Haskell.

---

arguments for name-before-type in variable declarations:

type-before-name examples: C, C++, Java, C#
name-before-type examples: Go, Rust, TypeScript?, Python type hints
https://go.dev/doc/faq#declarations_backwards
https://go.dev/blog/declaration-syntax
"A variable’s name is more important than its type" and also this allows the names to be left-aligned -- https://benhoyt.com/writings/name-before-type/

---

proj-plbook-plChSyntax