Revision 6 not available (showing current revision instead)

proj-plbook-plChParsingImpl

Table of Contents for Programming Languages: a survey

Chapter : parsing (and lexing)

The goal of these stages is to start with a text file and to end up either with a parse tree, or an informative syntax error.

todo intro to parse trees

The syntax of a programming language is usually represented by a grammar. A grammar specifies not only which textual strings are syntactically valid, but also how they should be parsed into a parse tree. todo intro grammars

The usual notation for a grammar is EBNF. todo intro BNF, EBNF. Todo explain the two interpretations of EBNF, see section 2.3.2.3, page 35 of http://dickgrune.com/Books/PTAPG_1st_Edition/

The general pipeline goes:

In some languages, this sort of thing is repeated in multiple passes. For example, languages that have macros that are expanded at parse time need to first go thru and identify and expand the macros, and then to lex and parse the transformed source code.

A token is usually taken to be a pair consisting of a token type and a token value. For example, the integer literal 3 might be processed to a token containing (INTEGER_LITERAL, 3). In practice tokens may also have other information such as a pointer to say where in the source code they occur, to aid later debugging. The part of the source code corresponding to a certain token is called a 'lexeme'.

Tokenization usually proceeds by scanning the input to delimit and classify lexemes, and also running some lexemes through a lexeme evaluator to produce the token values.

The scanner is usually specified in terms of regular expressions (regexps), or some sort of extended regular expressions.

Highly non-local properties, such as matching parenthesis, are put off until the parsing stage.

The parsing stage takes a stream of tokens as input, and, using a grammar over tokens, produces a parse tree as output.

Code that does parsing is called a 'parser', and it is usually governed by a specification of valid parse trees (and hence of valid strings) called a 'grammar'. There are different classes of grammars; various classes have various expressive power, and various types of parsers with various efficiencies and properties are available for the different classes.

Most common grammars and parsers can do lexical analysis, too. However, in practice, lexical analysis is usually done as a separate stage for reasons of both efficiency and simplicity.

When designing a programming language, you may want to choose the syntax in such a way so as to make it efficient and easy to tokenize and to parse. todo what proportion of compiler/interpreter time is usually spent in lexing and parsing -- i've heard the parse time can be significant? The part of syntax corresponding to lexical analysis is often called 'lexical syntax'. To make it efficient and easy to do lexical analysis, you should choose a lexical syntax amenable to scanning with regular expressions (best) or extended regular expressions (not as good but still good).

To make it efficient and easy to parse, you should choose a type of grammar over tokens for which a standard and efficient type of parser is available.

Note that many popular languages do not follow these prescriptions, and use ad-hoc algorithms for lexing and parsing that are more complex than simply scanning with regexes, running token evaluators on lexemes, and then using standard parsing algorithms with a common grammar type.

What kind of grammar over tokens should you choose for your programming language? What sorts of types of grammars are there for which standard and efficient parsers are known, and what are their trade-offs and limitations? I've found it surprisingly hard to find a concise overview of answers to these questions on the web (please send me any if you find them).

The clearest authoritative recommendation that I've found it:

"If one has the luxury of being in a position to design the grammar oneself, the choice is simple: design the grammar to be LL(1) and use a predictive recursive descent parser. It can be generated automatically, with good error recovery, and allows semantic routines to be included in-line. This can be summarized as: parsing is a problem only if someone else is in charge of the grammar." -- http://dickgrune.com/Books/PTAPG_2nd_Edition/

I was surprised to find that recommendation since there are so many other grammars which are more expressive than LL(1); when I initially started looking this stuff up, I had gotten the impression that we had moved beyond LL(1), which is one of the most constraining types of grammar, and that it would be a good idea to write LALR(1) languages. But I guess not. Other commentators, although not as authoritative as the authors of PTAPG, often seemed to agree; a common sentiment was that working with extant LALR and LR parser generators is not ideal (and writing them for oneself is a long and difficult task, at least when one includes the necessity of error reporting) and that it's better if you have at least the option of avoiding them.

An additional somewhat commonly observed sentiment to note is that some people seemed to think that mixing operator precedence parsing with predictive recursive descent LL(1) parsing is okay. So if you want your recursive descent parser to be more efficient or to more clearly represent operator precedence than with a BNF grammar, you can mix in operator precedence parsing e.g. Top Down Operator Precedence or Pratt Parsing.

If you are writing your compiler or interpreter in a functional language such as Haskell or Scala, you may be able to use parser combinators instead of parser generators (see http://stackoverflow.com/questions/5507665/are-there-any-ll-parser-generators-for-functional-languages-such-as-haskell-or-s ). Whether or not parser combinators are available, you could write your own predictive recursive descent parser.

However, you should still write down an LL(1) grammar in BNF (todo: will EBNF do?) and run your grammar thru an LL(1) parser generator first to make sure that (a) it is LL(1) (this also tells you that it is unambiguous), and (b) so that you have a platform-agnostic representation of the grammar.

Random note: minus ('-') is often a special case, because in traditional arithmetic notation it can be either unary negation or binary infix subtraction. With some parsing techniques this may be impossible (without extra passes or some other complication), and with some it may be simpler to have all unary operators bind tighter than all infix operators, which does not match traditional arithmetic, in which -2^4 = -(2^4), not (-2)^4. Some languages even go so far as to make the programmer use a different symbol for unary negation and binary subtraction.

Another note: even if you won't constrain yourself to LL(k), you may want to avoid requiring a symbol table at parse time. E.g. in C, you need a symbol table to parse, because you need to know if some identifiers are type names or not in order to parse some lines containing those identifiers. This apparently makes parsing significantly more complex, and the designers of some languages such as Golang which are known for fast compilation proudly declare that they avoid this.

Some tutorials and articles on recursive descent parsing and Pratt parsing:

Links that list programming languages and observations about what types of grammar or parser they use:

Advanced topics in choice of grammar

Some types of common grammars are, in declining order of expressiveness:

Van Wijngaarden grammars, also called W-grammars, which are two-level grammars where a meta-grammar is specified which determines the actual phrase-structure grammar. PTAPG 1st ed. says that, unlike unrestricted one-level grammars, Van Wijngaarden grammars can be written in a clear manner, but there are not good parsing techniques available yet (at that time).

The other grammars that we will consider are called phrase structure grammars.

Unrestricted (phrase structure) grammars are simply the class that includes all (phrase structure) grammars. They are equivalent to Chomsky Type 0 grammars (by which we mean, the set of languages that can be recognized by unrestricted grammars are equal to the set of languages that can be recognized by Chomsky Type 0 grammars), which have the restriction that the left side of the production rules includes only nonterminals.

Context-sensitive grammars are grammars with production rules of the form aXb -> ayb, where a and b are any string of terminals and/or non-terminals (and may be empty strings), X is exactly one nonterminal, and y is an non-empty string of terminals and/or nonterminals; or of the form S -> epsilson (where epsilon represents the empty string), where S does not appear on the right-hand side of any production. They are equivalent to noncontracting grammars, grammars whose production rules all obey the constraint that the string length of the left side of the rule is not longer than the string length of the right side of the rule. This noncontraction property is the key difference between context-sensitive grammars and unrestricted grammars, and allow context-sensitive grammars to be recognized by linear-bounded automata (restricted Turing machines which only have an amount of tape linear in the length of the input).

Context-free grammars are grammars with production rules of the form X -> y, where X is exactly one nonterminal, and y is any string of terminals and/or nonterminals, possibly empty. The lack of any context around the nonterminal being consumed/expanded is the key difference between Context-free grammars and context-sensitive grammars.

Right regular grammars are grammars with all production rules of one of the three following forms: X -> a, X -> aY, X -> \epsilon, where X is exactly one nonterminal, a is exactly one terminal, and Y is exactly one nonterminal. Right regular grammars are equivalent to left regular grammars (where the second rule is replaced by X -> Ya), so we just speak of "regular grammars". Regular grammars can be represented by regular expressions.

PTAPG 1st. ed chapter 11, page 239 says "Unrestricted (Type 0) and context-sensitive (Type 1) grammars Unrestricted (Type 0) and context-sensitive (Type 1) grammars are hardly used since, first, they are user-unfriendly in that it is next to impossible to construct a clear and readable Type 0 or Type 1 grammar and, second, all known parsers for them have exponential time requirements."

Regular expressions are not expressive enough to express desired syntactical properties such as matching parentheses.

Therefore we will restrict further attention to grammars which are less expressive than content-sensitive grammars and more expressive than regular expressions.

A further desirable property to have is linear-time parsing. Let's further restrict our attention to grammars for which linear parsers are known (if you want something more general, you can use a Earley or GLR (Tomita) parser, which are n^3 see free online old edition of PTAPG section 11.2, page 250; PTAPG says that Tomita (GLR) is the best). Linear-time-parsable grammars

note: stuff in the above with k's is only a hierarchy when the k is the same, e.g. LL(1) < LALR(1) but LL(k) is disjoint with LALR(1); i think todo

the k's represent the amount of lookahead. E.g. LL(1) is LL with 1 lookahead.

Note that all of the above are unambiguous grammars except for some operator-precedence grammars; however, every language recognized by an operator-precedence grammar is a deterministic context-free language, which means that there exists some unambiguous context-free grammar that can recognize it; and except that some PEG grammars would have been ambiguous except that the PEG formalism resolves all ambiguity by choosing the first alternative.

Parser generators exist that try to transform a BNF or EBNF grammar that you give it into a more restricted, linear-time-parsable form like LALR(1), and generate a linear-time parser if it can do so.

http://en.wikipedia.org/wiki/Comparison_of_parser_generators shows that LR(1), LALR(1), LL(*), LL(k), LL(1), and packrat parser generators are all popular.

The popular ANTLR uses something they made up called LL(*), which is O(n^2) time according to http://www.antlr.org/papers/LL-star-PLDI11.pdf , but which "in practice" is often quite efficient.

on operator precedence

Operator precedence can be mixed with other parsing techniques.

Vaughan Pratt's "Top Down Operator Precedence", sometimes called Pratt parsing, is a parsing technique that combines Recursive Descent and Floyd's Operator Precedence. See also http://javascript.crockford.com/tdop/tdop.html , http://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing/ , https://news.ycombinator.com/item?id=2344837 , http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/ , http://en.wikipedia.org/wiki/Pratt_parser , http://effbot.org/zone/simple-top-down-parsing.htm . See also http://en.wikipedia.org/wiki/Operator-precedence_parser

" Operator-precedence parsers are very easy to construct (often even by hand) and very efficient to use; operator-precedence is the method of choice for all parsing prob- lems that are simple enough to allow it. That only a skeleton parse tree is obtained, is often not an obstacle, since operator grammars often have the property that the seman- tics is attached to the operators rather than to the right-hand sides; the operators are identified correctly. It is surprising how many grammars are (almost) operator-precedence. Almost all formula-like computer input is operator-precedence. Also, large parts of the grammars of many computer languages are operator-precedence. " -- PTAPG 1st ed. section 9.2.1, page 190

from http://en.wikipedia.org/wiki/Operator-precedence_parser , it seems like if you are only using operator precedence grammars for non-ambiguous CFGs like actual operator precedence, then you can already do operator precedence as part of an LL(1) grammar, and the only advantages of mixing in an operator precedence parser over plain recursive descent are:

so, in the absence of custom precedence levels, this is an optimization.

what that page calls the "Precedence climbing method" seems to be the thing that corresponds to Pratt parsing.

prioritized choice (e.g. the resolution of ambiguity choosing the first option in the grammar specification

PEGs and some parser combinators use this method of resolving ambiguity.

Some commentators think this is dangerous, because it makes it possible that the author of the grammar may not be made aware of the initial ambiguity:

" >> parser combinators don't warn you about ambiguity, LL/LR do.

> Parser combinators don't normally warn you about "ambiguity" because they normally have unambiguous semantics (different choices are always tried in order, by design). In my opinion, this is not better or worse, just different.

I strongly believe (and make a case for such in the article) that they are not "just different", but that making everything prioritized choice is actively harmful. Prioritized choice hides real language issues like the "dangling else" problem and the C++ "most vexing parse" where two interpretations of a program can look equally valid to a user, but one is declared "correct" arbitrarily. With an abstraction like PEGs or parser combinators, you are flying blind about whether your grammar contains such issues. "

more on parser combinators

http://stackoverflow.com/questions/5507665/are-there-any-ll-parser-generators-for-functional-languages-such-as-haskell-or-s

says that "The major reason for this is that most LL(k) parsers that are written in functional languages are just implemented using parser combinators, because the easiest path to generate a parser combinator library is recursive descent."

in other words if you have an LL(k) language, and your compiler or interpreter is written in a language with sufficiently powerful functional features, then you can (and should) just use a parser combinator library rather than a parser generator.

"One of my biggest problems with parser combinators is that they tightly couple the grammar to the host language. I believe that one of the biggest problems with parsers today is that they are usually not very reusable. By writing a parser combinator in (say) Haskell, it can only be used from Haskell (or, awkwardly, by embedding Haskell). If you have to write a different parser for every host language, then you have to write O(N2) parsers to parse N languages from N languages." -- http://www.reddit.com/r/programming/comments/1lx0oc/ll_and_lr_in_context_why_parsing_tools_are_hard/

" I have no objections to your problems with prioritized choice. I want to add, however, that I don't think prioritized choice is a fundamental part of parser combinators. In particular, context-free grammars specified using applicative parser combinators are still amenable to these desirable static checks, although I don't know of any implementations that actually do it. "

" Also, some real-life ambiguities (like C's type/variable ambiguity) are not resolved the same way every time, but are resolved based on semantic context. How can parser combinators do this?

Monadic parser combinators can be arbitrarily context-sensitive and allow you to backtrack. This is simultaneously the source of their power and their inefficiencies. You only have to suffer the inefficiencies if you use the extra power, though.

Since it's embedded in a host language, you can even patch up holes later without necessarily having to backtrack, but still within the parser, if you can't come up with a cleaner way to do it. It could be through some sort of knot-tying trick or perhaps something in a monad transformer stack that provides the power for you. "

Recommendations

" The initial demands on a CF parsing technique are obvious: it should be general (i.e., able to handle all CF grammars), it should be fast (i.e., have linear time requirements) and preferably it should be easy to program. There are two serious obstacles to this naive approach to choosing a parser. The first is that the automatic generation of a linear-time parser is possible only for a subset of the CF grammars. The second is that, although this subset is often described as “very large” (especially for LR(1) and LALR(1)), experience shows that a grammar that is designed to best describe the language without concern for parsing is virtually never in this set. What is true, though, is that for almost any arbitrary grammar a slightly different grammar can be found that generates the same language and that does allow linear-time parsing; finding such a grammar, however, almost always requires human intervention and cannot be automated. Using a modified grammar has the disadvantage that the resulting parse trees will differ to a certain extent from the ones implied by the original grammar. " -- -- free online old edition of PTAPG , section 11.1, page 252

" If one is in a position to design the grammar along with the parser, there is little doubt that LL(1) is to be preferred: not only will parsing and performing semantic actions be easier, text that conforms to an LL(1) grammar is also clearer to the human reader. A good example is the design of Modula-2 by Wirth (see Programming in Modula-2 (Third, corrected edition) by Niklaus Wirth, Springer-Verlag, Berlin, 1985). " -- free online old edition of PTAPG , section 11.3.2, page 252

" The book (mentioned by P), Parsing Techniques - A Practical Guide by Grune & Jacobs, is very good. The final chapter is titled "Practical Parser Writing and Usage." After 546 pages of surveying practically every known parsing algorithm, the authors give this very simple and useful advice:

    If one has the luxury of being in a position to design the grammar oneself, the choice is simple: design the grammar to be LL(1) and use a predictive recursive descent parser. It can be generated automatically, with good error recovery, and allows semantic routines to be included in-line. This can be summarized as: parsing is a problem only if someone else is in charge of the grammar.

I would go further and say that, given the current state of parser generators, it may be strongly preferable to use a hand-written LL(1) recursive descent parser. "

The following book is often reccommended (i haven't read it personally however): Dick Grune, Ceriel J.H. Jacobs. Parsing Techniques: A Practical Guide free online old edition of PTAPG

Case studies

misc

todo: other topics: http://en.wikipedia.org/wiki/Attribute_grammar , http://en.wikipedia.org/wiki/Syntax-directed_translation , http://en.wikipedia.org/wiki/Boolean_grammar , semantic rules in parsing , http://stackoverflow.com/questions/1857022/limitations-of-peg-grammar-parser-generators , http://blogs.ethz.ch/copton/2009/07/08/parsing-expression-grammars/ , http://compilers.iecc.com/comparch/article/92-05-059 , http://www.garshol.priv.no/download/text/bnf.html#id4.3. , http://programmers.stackexchange.com/questions/19541/what-are-the-main-advantages-and-disadvantages-of-ll-and-lr-parsing , http://stackoverflow.com/questions/6480634/examples-of-ll1-lr1-lr0-lalr1-grammars , http://nathansuniversity.com/pegs.html , http://nathansuniversity.com/pegs10.html , http://stackoverflow.com/questions/15673216/why-some-compilers-prefer-hand-crafted-parser-over-parser-generators , http://www.reddit.com/r/programming/comments/1lx0oc/ll_and_lr_in_context_why_parsing_tools_are_hard/ , http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.html , http://www.lclnet.nl/publications/pure-and-declarative-syntax-definition.pdf , http://www.cs.virginia.edu/~cs415/lectures/weimer-pl-07.pdf , semantic and syntactic predicates

https://groups.google.com/forum/#!topic/comp.compilers/tTUVATHiK2I

Abstract Syntax Trees (AST)

http://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees/

https://en.wikipedia.org/wiki/Abstract_syntax_tree esp. https://en.wikipedia.org/wiki/Abstract_syntax_tree#Design for design and https://en.wikipedia.org/wiki/Abstract_syntax_tree#Motivation for motivation of ASTs and https://en.wikipedia.org/wiki/Abstract_syntax_tree#Design_patterns

http://www.hillside.net/plop/plop2003/Papers/Jones-ImplementingASTs.pdf

Note that the the transformation from parse tree (concrete syntax tree) to AST may involve arbitrary computation, and that additional syntax errors may be recognized during this computation. So, even if the grammar recognized by the parser is characterized by some simple class (eg LL(1)), this does not imply that the language's syntax as a whole is not more complex.

Take Python, for example, whose parser recognizes Python's LL(1) grammar (modulo whitespace-sensitive layout in a context-sensitive lexing stage, i believe; see [1]). In Python you have assignments that look like "x = 3". There are some special syntactic constraints on the LHS of Python assignment statements. But if the grammar is LL(1), how can the parser know if it is in an LHS before it encounters the '=' sign? The answer is that it can't (i think, todo); these constraints are enforced only after the initial parsing is done, at the time when the parse tree is transformed to an AST.

For example, the following code shows you the Python parse tree for a string of source code:

import parser, token, symbol, pprint

# thanks http://bytes.com/topic/python/answers/470920-recursive-map-nested-list
def recur_map(fun, data):
  if hasattr(data, "__iter__"):
    return [(elem != data and recur_map(fun, elem)) or fun(elem) for elem in data]
  else:
    return fun(data)

def prettyprint_parse_tree_for_code(code_string):
  parsed = parser.suite(code_string).tolist()
  parsed_mapped = recur_map(lambda x: ((type(x) == int) and x in symbol.sym_name and symbol.sym_name[x]) or ((type(x) == int) and x in token.tok_name and token.tok_name[x]) or x, parsed)
  return pprint.PrettyPrinter().pprint(parsed_mapped)

and this code shows you the AST:

import ast; ast.dump(ast.parse(code_string))

Compare the parse tree for 'a = 3 + 2' to the AST for the same code:

code_string = 'a = 3 + 2'
print(); prettyprint_parse_tree_for_code(code_string)

['file_input',
 ['stmt',
  ['simple_stmt',
   ['small_stmt',
    ['expr_stmt',
     ['testlist_star_expr',
      ['test',
       ['or_test',
        ['and_test',
         ['not_test',
          ['comparison',
           ['expr',
            ['xor_expr',
             ['and_expr',
              ['shift_expr',
               ['arith_expr',
                ['term',
                 ['factor',
                  ['power', ['atom', ['NAME', ['a']]]]]]]]]]]]]]]]],
     ['EQUAL', ['=']],
     ['testlist_star_expr',
      ['test',
       ['or_test',
        ['and_test',
         ['not_test',
          ['comparison',
           ['expr',
            ['xor_expr',
             ['and_expr',
              ['shift_expr',
               ['arith_expr',
                ['term',
                 ['factor',
                  ['power', ['atom', ['NUMBER', ['3']]]]]],
                ['PLUS', ['+']],
                ['term',
                 ['factor',
                  ['power', ['atom', ['NUMBER', ['2']]]]]]]]]]]]]]]]]]],
   ['NEWLINE', '']]],
 ['NEWLINE', ''],
 ['ENDMARKER', '']]


import ast; print(); ast.dump(ast.parse(code_string))

"Module(body=[Assign(targets=[Name(id='a', ctx=Store())], value=Num(n=3))])"

So, at what stage are the special constraints for LHS of assignments in "a = 3 + 2" applied? They are applied during the transformation to AST. In the initial concrete parsing stage, both the LHS and the RHS of this code are both just recognized as 'expr's, as you can see in the parse tree above.

To further investigate this, let's try some Python code that is illegal only on the LHS, for example 'yield 3':

code_string = 'def f():\n  (yield 3) = x'
print(); prettyprint_parse_tree_for_code(code_string)

['file_input',
 ['stmt',
  ['compound_stmt',
   ['funcdef',
    ['NAME', [['d'], ['e'], ['f']]],
    ['NAME', ['f']],
    ['parameters', ['LPAR', ['(']], ['RPAR', [')']]],
    ['COLON', [':']],
    ['suite',
     ['NEWLINE', ''],
     ['INDENT', ''],
     ['stmt',
      ['simple_stmt',
       ['small_stmt',
        ['expr_stmt',
         ['testlist_star_expr',
          ['test',
           ['or_test',
            ['and_test',
             ['not_test',
              ['comparison',
               ['expr',
                ['xor_expr',
                 ['and_expr',
                  ['shift_expr',
                   ['arith_expr',
                    ['term',
                     ['factor',
                      ['power',
                       ['atom',
                        ['LPAR', ['(']],
                        ['yield_expr',
                         ['NAME',
                          [['y'],
                           ['i'],
                           ['e'],
                           ['l'],
                           ['d']]],
                         ['yield_arg',
                          ['testlist',
                           ['test',
                            ['or_test',
                             ['and_test',
                              ['not_test',
                               ['comparison',
                                ['expr',
                                 ['xor_expr',
                                  ['and_expr',
                                   ['shift_expr',
                                    ['arith_expr',
                                     ['term',
                                      ['factor',
                                       ['power',
                                        ['atom',
                                         ['NUMBER',
                                          ['3']]]]]]]]]]]]]]]]]]],
                        ['RPAR', [')']]]]]]]]]]]]]]]]],
         ['EQUAL', ['=']],
         ['testlist_star_expr',
          ['test',
           ['or_test',
            ['and_test',
             ['not_test',
              ['comparison',
               ['expr',
                ['xor_expr',
                 ['and_expr',
                  ['shift_expr',
                   ['arith_expr',
                    ['term',
                     ['factor',
                      ['power',
                       ['atom', ['NAME', ['x']]]]]]]]]]]]]]]]]]],
       ['NEWLINE', '']]],
     ['DEDENT', '']]]]],
 ['NEWLINE', ''],
 ['ENDMARKER', '']]

import ast; print(); ast.dump(ast.parse(code_string))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/ast.py", line 35, in parse
    return compile(source, filename, mode, PyCF_ONLY_AST)
  File "<unknown>", line 2
SyntaxError: can't assign to yield expression

As we can see, this expression parses fine, and the (yield x) on the LHS is classified as an 'expr'. But when the parse tree is transformed to an AST, there is a syntax error. This syntax error is generated in Python/ast.c, for example, http://svn.python.org/projects/python/tags/r26b3/Python/ast.c , function ast_for_expr_stmt, section ' /* a normal assignment */'. We see that there is arbitrary code being executed here that checks things like, is there an equals sign after the LHS, and is the LHS a yield (not allowed, so if so, emit an error).

So what effect do the limitations of grammar classes such as LL(1) have on a language? What happens during the initial parsing that would be hard to 'fix up' later? One way to think of it is that the task of the inital parsing is to determine GROUPING. Whether some substring of source code is inside a string or a comment or not, which subexpressions are nested within which expressions and larger blocks of code, etc. In other words, we're taking the linear string of source code and transforming it to a hierarchical tree. To get an AST, we'll transform this tree, and we may even non-locally propagate information by attaching various data to various tree nodes and by creating various extra-tree data structures, but any transformation of the topology of the tree itself will typically be LOCAL; we might replace this node and all of its descendents with a more concise AST representation (like the Assign node in the example above), but at this stage we typically won't do non-local topological transformations, for example moving the children of the cousin of a node A to underneath A (except as we apply various metaprogramming constructs such as source filters or macros).

Applying this reasoning to the example above, the parsing of '(yield 3) = x' told us that this expression breaks into '(yield 3)', '=', and 'x'; it would not be advisable to later on say that it should instead have been broken into '(yield', and '3) = x'. If you wanted a result like that, it would be better to modify the grammar, and it is here that the constraints of LL(1) might limit you. However, we are allowed to look at '(yield 3)', '=', and 'x' after initial parsing and during the conversion to AST, to categorize this as an Assignment, and to then enforce arbitrary constraints such as that the LHS may not be a 'yield'; because this does not change the grouping.


About LL(k) grammars

"

LL(k) grammars are subsets of context-free grammars.

LL(k) grammars can be parsed by 'predictive parsers'; a predictive parser is a recursive descent parser that does not require backtracking [3].

Necessary but not sufficient conditions for a grammar to be LL(k) include:

A grammar is ambiguous if there is at least one string with two or more parse trees. [4]

(todo: by [5] and the following sentence the non-sufficency of those is true for LL(1) but is it true for LL(k)?)

Any context-free grammar can be transformed into an equivalent grammar that has no left recursion, but this transformation may affect the grouping in the parse tree [6] [7].

"Ambiguous grammars are not LL(1) but unambiguous grammars are not necessarily LL(1)" -- [8] (todo: is this statement true for LL(k))

Links:

Transforming a grammar for LL(k) parsing

Eliminating left recursion

todo

http://lambda.uta.edu/cse5317/notes/node13.html

Left factoring

todo http://lambda.uta.edu/cse5317/notes/node13.html

Example of a non-LL(k) languag

http://stackoverflow.com/questions/6855666/finding-a-language-that-is-not-ll1 gives:

S -> A | B
A -> 'a' A 'b' | ε
B -> 'a' B 'b' 'b' | ε

todo: confirm

Links for: Transforming a grammar for LL(k) parsing

Concrete grammars vs. abstract grammars

For example, you might have an abstract grammar which is ambiguous and contains left-recursion, and then transform it to a concrete grammar which is LL(1).

For example, if you want to parse a subset of expressions like those used in Haskell for type annotations, for example "Integer -> (Integer -> Bool) -> Bool", you might have in mind the abstract grammar:

Type : Integer

Bool Type -> Type

But this is ambiguous, so you might convert it to the following LL(1) grammar:

Type: Type1 TypeOps? TypeOps?: epsilon

Type1: Integer
-> Type
Bool ( Type )

Note that not only is the abstract grammar is ambiguous, but furthermore it does not include the parentheses.

-- [9]

The syntax for the concrete grammar is called the concrete syntax, the tree generated by parsing with the concrete grammar is called a 'concrete parse tree', etc, and similarly for the abstract grammar.

Another example is that the concrete syntax tree might contain interior nodes that are not useful to us after parsing, for example perhaps the concrete parse tree for "x3 = y + 3" might look like:

         assignment statement
        /                    \
   identifier               expression
       |                   /    |     \
       x3           expression  +  expression
                        |              |
                    identifier     identifier
                        |              |
                        y              3

while the abstract syntax tree might look like:

        =
       / \
      x3  +
         / \
        y   3

(example from [10])

misc

remember: a parethesized expression is an representation of a tree data structure

---

Links

todos

"

masklinn 126 days ago [-]

> In that respect, I would suggest beginning with a simple recursive-descent parser that can handle the usual maths expressions - then extending it to e.g. all the operators in a C-like language is not a huge leap (table-driven precedence-climbing can be derived from recursive descent and makes this much easier), and it also leads to the opportunity for a compiler that can parse and compile itself - unless you are also writing the compiler in Lisp.

I'd suggest starting with a Top Down Operator Precedence ("Pratt") parser, they're beautiful magic for the working man, highly declarative and slightly mind-bending in that they seem so simple they shouldn't work, they make expressions with infix priority a joy to handle.

userbinator 126 days ago [-]

TDOP, Pratt, precedence-climbing, I think they're all referring to the same, if not extremely similar, method of using a loop in the main recursive function that compares precedences and decides whether to recurse or loop:

https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing

http://javascript.crockford.com/tdop/tdop.html

They're definitely very simple, and can be derived by refactoring a recursive descent parser and observing that there's no need to make a chain of nested recursive calls to the right level when the function can be called directly to "jump" to the desired level. The simplicity is probably why it's been rediscovered several times and has different names.

masklinn 126 days ago [-]

TDOP and Pratt are exactly the same thing, Vaughan Pratt is the inventor, Top Down Operator Precedence is the name he gave to the algorithm. And TDOP never "recurses or loops" (the precedence tells it whether it should stop looping). Part of the interest is how object-oriented it is, the main parsing loop ("expression" in the Crockford article) just tells a token "parse yourself as prefix" or "parse yourself as infix preceded by (AST)"

abecedarius 126 days ago [-]

They're the same algorithm expressed in different styles: structured for precedence climbing, and as you say OO or 'data driven' for Pratt.

https://github.com/darius/sketchbook/blob/master/parsing/precedence_climbing.py#L29 https://github.com/darius/sketchbook/blob/master/parsing/pratt.py#L13 "

-- https://news.ycombinator.com/item?id=11396316

"

CJefferson 4 days ago [-]

One piece of advice I have for people writing a new language.

Consider making your language so you can do one-character lookahead, with something like the 'shunting algorithm' to handle precedence.

I work on a system called gap ( www.gap-system.org ), and when parsing we only ever need to look one character ahead to know what we are parsing. This makes the error messages easy, and amazingly good -- we can say "At this character we expected A,B or C, but we found foo". It also makes the language easy to extend, as long as we fight hard against anyone who wants to introduce ambiguity.

If your language is ambiguous, and you have to try parsing statements multiple times to find out what they are, then your error handling is always going to be incredibly hard, as you won't know where the problem arose.

Of course, if you are parsing a language given to you, you just have to do the best you can.

reply

chrisseaton 4 days ago [-]

It's a balancing act between designing syntax that makes sense for the programmer and syntax that makes sense for the parsing technology.

I'd feel a bit silly explaining to someone that some twisted or ugly syntax construct was like that because it made writing the parser a bit easier, when the parser only has to be written once but the syntax has to be used in every program written using the language.

reply

barrkel 4 days ago [-]

Code that's hard to parse is easy to get wrong in surprising ways. Difficulty in parsing almost always means ambiguous parses that need to be resolved with sufficient lookahead or semantic lookup; if the user makes a mistake, it may be in one of the ambiguous cases, which leads to surprise. In particular, error messages are impossible to keep clearly prescriptive in every case, because thy needs to take account of the inherent ambiguity.

reply

tyingq 4 days ago [-]

Does anyone have any examples of popular languages that are very easy to parse or ones that have a lot of ambiguity?

I assume the lispy type languages fall into the easily parsed bucket, perhaps tcl as well? And Perl may be a good example of a language that has notable ambiguity?

reply

chubot 4 days ago [-]

Python, Java. Python generates a top-down parser from Grammar/Grammar using its own pgen.c system, and the Java spec has a grammr. This doesn't mean there's no ambiguity, but they are easier.

JavaScript?, Go. I think the semi-colon insertion rule kind of breaks the lexer/grammar model, but they are still easy to parse.

C++, Perl, to some extent bash. See the end of my post "Parsing Bash is Undecidable" for links: http://www.oilshell.org/blog/2016/10/20.html

I'm not as familiar with languages like PHP and C#, but they're somewhere between the easy and hard categories.

reply

munificent 4 days ago [-]

Smalltalk is famously easy to parse. They always proudly said you can fit the entire grammar on a single page. It's a really compact, neat syntax.

Pascal is also pretty clean and regular. Unlike Smalltalk, it was designed more to make the compiler's job easier, so not only is the grammar pretty regular, but it's organized such that you can parse and compile to native code in a single pass.

C is a pain to parse, it's context-sensitive in some areas. C++ inherits all of that and adds a lot more on top.

My understanding is that Perl can't be parsed without being able to execute Perl code since there are some features that let you extend the grammar in Perl itself?

reply

rurban 3 days ago [-]

Perl is mostly hard to parse because it using a dynamic lexer. The parser itself uses a normal static yacc grammar, but the lexer drives the parse dynamically. Eg if a class or method is already defined or not, and what types to expect as arguments. It's crazy, but beautiful. Not so insane, more like a very clever hack.

reply

bandrami 4 days ago [-]

Forth is probably the simplest to parse, and when you write a forth you usually write your parser in assembly. It consists of "start at the first non-whitespace character of the pad and continue until you reach another whitespace character; that's a token" (though "token" has a different meaning in forth; it's a "word").

Lisp seems like it should be nearly as simple: recursively read in a balanced s-expression and attempt to intern the atoms in that s-expression; error out if any of those steps fails. However, Common Lisp's scanner/parser (called the "reader") allows the user to define arbitrary macros; basically the scanner has access to the full compiler if you want it to, so it can get really complicated really quickly.

reply

bandrami 4 days ago [-]

And, just to finish that thought, "read macros" aren't some abstract corner case of interest to theoreticians; they are how a whole lot of the language and libraries (e.g. complex numbers, the regex library, etc.) are implemented.

reply

Drup 4 days ago [-]

Raw lisp is obviously very easy, not sure about the various schemes. OCaml, the whole SML family are easy to parse (just an LR grammar that is easy to copy). I think Rust is the same. IIRC, Java is LL(k), so very easy.

C is not LR(1) for annoying reasons (typedef). Same for Go, IIRC. Otherwise, it's okayish. Javascript has this inane "optional semicolon" thing, which probably makes it annoying.

C++ is probably the most insane thing to parse. And then, there is bash[1].

http://www.oilshell.org/blog/2016/10/20.html

reply

steveklabnik 4 days ago [-]

Rust is 99% easy to parse, but there's one, rarely used, syntactic construct that's context-sensitive.

reply

saosebastiao 4 days ago [-]

Which one is that?

reply

steveklabnik 4 days ago [-]

Raw string literals https://doc.rust-lang.org/stable/reference.html#raw-string-l... and raw byte string literals https://doc.rust-lang.org/stable/reference.html#raw-byte-str...

  r"…", r#"…"#, r##"…"##
  br"…", br#"…"#, br##"…"##

you need to count the #s.

reply

zeven7 4 days ago [-]

JavaScript?'s not exactly "easy" to parse (though "easy" is subjective), but it is (mostly?) parseable with an LR(1) parser (only requires lookahead of 1 character) -- though it does require reinterpretation.

reply

tgb 4 days ago [-]

I used to use gap! I'd love some examples of explicit choices that were made that allow one-character lookahead.

reply

CJefferson 4 days ago [-]

I can tell you some things where were rejected.

One example of non-one-token lookahead is lambda functions, 'x -> x*2'. This means when you read an identifier, before you check if it valid in this scope you have to check if the next thing is '->'. If it is, this is a new identifer, if it's not, it's a variable reference and better be in scope.

Now, let's consider multi-argument lambda functions. One possible notation is: (a,b,c) -> a+b+c. However, (a,b,c) is how GAP denotes permutations, and it was decided that parsing a whole permutation as "maybe this is a permutation, maybe it is the beginning of a lambda expression" would be too annoying, and would mean we couldn't print out some errors until we got to the end, and knew what type of object we had.

reply " -- [11]

---

some suggest NOT using regexes for lexing:

"The first tool that beginning compiler writers often reach for is regex. Regex is just the wrong tool for lexing and parsing. Rob Pike explains why reasonably well." [12]

" Regular expressions are hard to write, hard to write well, and can be expensive relative to other technologies. (Even when they are implemented correctly in N*M time, they have significant overheads, especially if they must capture the output.) Lexers, on the other hand, are fairly easy to write correctly (if not as compactly), and very easy to test. Consider finding alphanumeric identifiers. It's not too hard to write the regexp (something like "[a-ZA-Z_][a-ZA-Z_0-9]*"), but really not much harder to write as a simple loop. The performance of the loop, though, will be much higher and will involve much less code under the covers. A regular expression library is a big thing. Using one to parse identifiers is like using a Mack truck to go to the store for milk. And when we want to adjust our lexer to admit other character types, such as Unicode identifiers, and handle normalization, and so on, the hand-written loop can cope easily but the regexp approach will break down.

A similar argument applies to parsing. Using regular expressions to explore the parse state to find the way forward is expensive, overkill, and error-prone. Standard lexing and parsing techniques are so easy to write, so general, and so adaptable there's no reason to use regular expressions. They also result in much faster, safer, and compact implementations.

Another way to look at it is that lexers and parsing are matching statically-defined patterns, but regular expressions' strength is that they provide a way to express patterns dynamically. They're great in text editors and search tools, but when you know at compile time what all the things are you're looking for, regular expressions bring far more generality and flexibility than you need. " [13]

---

Parsing: a timeline Version 3.0 10 May 2018 Jeffrey Kegler (historical context for Kegler's Marpa)

https://jeffreykegler.github.io/personal/timeline_v3

discussion: https://news.ycombinator.com/item?id=16856694

---

from https://news.ycombinator.com/item?id=16856694 , more discussion on https://jeffreykegler.github.io/personal/timeline_v3

xiaq 39 days ago [-]

I get the quest for more elegant and powerful parsing techniques, but it seems that some very interesting real-world problems - mostly arising in code editors - don't get enough academic love:

These can be very useful for code editors, for syntax highlighting and code completion. Some editors work around this by using a more restricted lexical syntax and avoiding to build a complete parse tree (Vim); newer editors - JetBrain? and VS Code AFAIK - do have some pretty good techniques, but they don't seem to get a lot of academic treatise.

danharaj 39 days ago [-]

You might like this page:

http://okmij.org/ftp/continuations/differentiating-parsers.html

catpolice 39 days ago [-]

Fast re-parsing is tricky. I'm building a JS IDE for a simple language, and it needs to only re-parse the part of the syntax tree that an edit affects (and only update the minimal number of affected DOM nodes). Writing the (recursive descent) lexer/parser by hand was easy. Being able to stream text diffs to the lexer/parser and have it rebuild only the appropriate part of the AST took... a bit of work. Writing unit tests might kill me.

blake8086 39 days ago [-]

You will probably be interested in this explanation of how xi-editor does it:

https://google.github.io/xi-editor/docs/rope_science_11.html

xiaq 39 days ago [-]

I cannot find the link now, but I remember reading a recent article saying that what VSCode does is basically storing some context information for each line. When input changes, the line is reparsed, and then the context provides enough information on how to modify the parse tree.

catpolice 39 days ago [-]

Yeah, that's pretty similar to what I ended up doing - it comes pretty natural with recursive descent to just store enough of the parser state on relevant nodes that you can restart it at natural breaks.

For my application, I end up having to make expensive async queries about the expressions the user writes. In order to minimize server load/make things responsive/avoid invalidating perfectly good caches, this means that I have to know exactly what to keep and what to throw out with each edit, which means that I never want to replace any node in the AST with an identical one that was needlessly re-parsed if I can help it.

So my lexer does a lot of work trying not to throw out parts of lines that the user hasn't touched, and the parser can figure out what changes are relevant to invalidate my caches.

Oddly the restartable lexer has turned out to be the most frustrating part - selectively diffing parts of two doubly linked lists and snipping them together in the way I ended up having to is one of those things that isn't conceptually very difficult but has the potential for unusually subtle bugs.

chubot 38 days ago [-]

This feels like it's fairly far removed from practice, but this talk addresses reparsing in an algebraic fashion:

Monoidal Parsing - Ed Kmett

https://www.youtube.com/watch?v=Txf7swrcLYs

Promarged 38 days ago [-]

I think these issues were directly tackled on by Roslyn team, that wanted something more than just compiler black box (text goes in, machine code goes out). There are some interesting videos on Channel9 about Roslyn internals (sorry no access now ).

nly 38 days ago [-]

I feel you are right. Most AOT parsing jobs (compiled languages etc.) are well-solved by straightforward, 'inefficient' recursive descent.

Untrusted inputs with complex grammars or realtime parsing like the IDE use-case are vastly more interesting.

---

from https://news.ycombinator.com/item?id=16856694 , more discussion on https://jeffreykegler.github.io/personal/timeline_v3

DonaldPShimoda? 39 days ago [-]

This was a good article! It's (reasonably) missing one of my favorite parsing algorithms, though, which is Might's "Parsing with Derivatives" [0, 1]. It's not very efficient compared to some other algorithms, but I think it is very elegant conceptually.

[0] Blog post: http://matt.might.net/articles/parsing-with-derivatives/

[1] Paper: http://matt.might.net/papers/might2011derivatives.pdf

[2] Improvements: https://michaeldadams.org/papers/

fithisux 39 days ago [-]

The author, according to my understanding, says that parsing with derivatives does not improve the state of the art either in terms of complexity or in terms of elegance.

My personal opinion is that Matt Might uses a different avenue to derive the state of the art and even though the original algorithm can be trapped to exponential complexity, the improved analysis and amendments in the 2012 paper save the situation.

I think it is quite an achievement to fight against so many experts who believed that the flaws could not be worked around and re-derive the state of art complexity. At least in the field of pure mathematics this has a very big value.

DonaldPShimoda? 39 days ago [-]

You bring up good points! I think the efficiency in real-world applications is still not as good as alternative solutions, but what I really like about PWD is that it's a general parsing algorithm that's incredibly simple to understand (if you have a working knowledge of regular languages). The implementation can be a little more tricky (e.g. if you've never written a fix-point before[0]), but the actual concepts just seem so simple. PWD was actually one of the first parsing algorithms I learned about (odd, I know), and even without much prior knowledge of the field of parsing I was able to grasp the gist of it in a single sitting.

[0] You can also just skip over establishing a fix-point if you guarantee that your language is not left-recursive.

bgorman 39 days ago [-]

The author would be wise to look at clojure.spec, a project which has leveraged parsing with derivatives to revolutionize dynamic language data validation.

cwzwarich 39 days ago [-]

Did any experts actually say that the flaw couldn't be worked around? I thought the argument was just that working around the flaw gives you known algorithms.

---

from https://news.ycombinator.com/item?id=16856694 , more discussion on https://jeffreykegler.github.io/personal/timeline_v3

haberman 39 days ago [-]

The author has assembled an interesting timeline. It leaves a lot out (this is inevitable; the theoretical research around parsing is unbelievably vast). But ultimately the goal of the article appears to be placing their project Marpa in its historical context.

Marpa is one of many attempts to "solve" the problem of parsing. Other notable attempts are PEG (Parsing Expression Grammars), the ALL(*) algorithm from ANTLR (http://www.antlr.org/papers/allstar-techreport.pdf) and GLR. I wrote an article about what makes this such a difficult problem: http://blog.reverberate.org/2013/09/ll-and-lr-in-context-why-parsing-tools.html

---

from https://news.ycombinator.com/item?id=16856694 , more discussion on https://jeffreykegler.github.io/personal/timeline_v3

chrisaycock 39 days ago [-]

I like ANTLR 4's innovations, which are missing from the article.

  This paper introduces the ALL(*) parsing strategy that
  combines the simplicity, efficiency, and predictability of
  conventional top-down LL(k) parsers with the power of a GLR-
  like mechanism to make parsing decisions. The critical
  innovation is to move grammar analysis to parse-time, which
  lets ALL(*) handle any non-left-recursive context-free
  grammar.

https://dl.acm.org/citation.cfm?id=2660202

drfuchs 39 days ago [-]

Non-paywalled: http://www.antlr.org/papers/allstar-techreport.pdf

---

from https://news.ycombinator.com/item?id=16856694 , more discussion on https://jeffreykegler.github.io/personal/timeline_v3

ulrikrasmussen 39 days ago [-]

Also, for a very comprehensive survey on parsing techniques, I can recommend "Parsing Techniques" by Grune and Jacobs: https://dickgrune.com/Books/PTAPG_2nd_Edition/

iovrthoughtthis 39 days ago [-]

How have you totally missed GLL and GLR parsing?!

http://dotat.at/tmp/gll.pdf

undecidabot 39 days ago [-]

One of the most interesting parser libraries I've encountered uses GLL. It's a clojure library called Instaparse [1]. The author had a nice presentation on it [2]. I'm not sure how well it works in practice though. Another great article on the subject [3].

[1] https://github.com/Engelberg/instaparse

[2] https://www.youtube.com/watch?v=b2AUW6psVcE

[3] https://epsil.github.io/gll/

chubot 39 days ago [-]

Who uses them? I know someone who wrote a GLR parser generator, but I've never seen the algorithm used in production.

I've heard of one C++ front end that uses it -- Elkhound -- but the only context I've heard of Elkhound is in GLR parsing! As far as I can tell Clang is now the state of the art. (i.e. Clang probably does everything Elkhound does, but better and faster.)

I have looked at least 30+ parsers for programming languages. I see:

That's about it. I've never encountered notable usages of Earley or GLL or GLR parsing.

housel 39 days ago [-]

The Spoofax Language Workbench http://www.metaborg.org/en/latest/ uses scannerless generalized LR parsing. When language processor development effort is more important than raw parsing speed (e.g. for external domain specific languages), Spoofax's grammar, abstract syntax tree, and semantics tools are exactly the right thing.

miloignis 39 days ago [-]

I don't know if it really counts, but I used RNGLR for the programming language I've been working on, and have been planning on replacing it with a GLL version in a bit. RNGLR was fairly complicated, and I did find the paper a bit hard to read, which might have limited its spread somewhat. When messing with the syntax for my language, it was nice to only have to worry about if my grammar was ambiguous rather than weather it would fit into LL, LR or LALR, and made it easy to change syntax very quickly without changing any actual code.

swift 39 days ago [-]

FWIW, Bison (which I'd expect most people are using rather than the original yacc, though I can't be sure) does support GLR parsing.

I haven't seen GLR used in production either, but my experience with projects I've been involved in is that this is sometimes just due to inertia. Taking full advantage of GLR's support for ambiguity requires reworking both your grammar and your early stages of semantic analysis, and even if people agree that the result might be nicer, nobody has the time or motivation to do all that work.

mncharity 39 days ago [-]

I've used GLR to parse C files in isolation - no lexer hack needed. More generally, at least historically, a lot of wizzy parsing tech has been closely held by companies selling their ability to deal with enterprise legacy code.

eesmith 39 days ago [-]

Tangentially related, the author of this timeline has developed an Earley parsing framework called Marpa. http://savage.net.au/Marpa.html .

It's used in other projects (eg, https://github.com/jddurand/MarpaX-Languages-C-AST ) but the only ones I found which look production ready, in a cursory search, also involved the author.

chubot 38 days ago [-]

Thanks, this is what I'm getting at. A lot of parsing techniques seem to fall into the category of "only people who care about parsing use them."

That's not to say they will never jump the chasm. I'm just a bit conservative in my design choices and I look for algorithms that have been "battle-tested".

eesmith 38 days ago [-]

As a minor commentary, back around 2000 I used Aycock's SPARK parser for Python, which was based on the Earley algorithm. I first learned about the Earley algorithm from Aycock's presentation at the Python conference IPC7. (His use of docstrings to annotate the grammar went on to influence Beazley's PLY parser, which is LALR(1))

Aycock's work with SPARK was one of the influences of Kegler's Marpa. I was surprised to see Aycock mentioned on this timeline as it was the only name were I could say "I met him".

Thus, I can say that for my work, I used the Earley algorithm. However, I don't think it was essential for the parsing I needed, only that it was available.

chubot 38 days ago [-]

Ah that's interesting. I recognize SPARK because it used to be part of the Python distribution, used to parse the DSL that describes Python's AST:

https://eli.thegreenplace.net/2014/06/04/using-asdl-to-descr...

In other words it was DSL used to implement a DSL used to implement Python -- how meta! But that post describes replacing it with a simple recursive descent parser in Python.

I used ASDL itself (not SPARK) extensively in my shell:

http://www.oilshell.org/blog/tags.html?tag=ASDL#ASDL

But still I would say that counts as a production usage of Earley parsing! Interesting. I don't know of any others.

(On the other hand, the fact that it was replaced with a few hundred lines of Python code means it probably wasn't needed in the first place.)

DonaldPShimoda? 39 days ago [-]

Not exactly what you're after, but I've been using Might's "Parsing With Derivatives", which is a general parsing algorithm, for my toy language I'm working on. Links are in my root-level comment above. (I'm on mobile or else I'd just provide the links for you here directly.)

yongjik 39 days ago [-]

I'm no expert, but my (very crude) understanding is that much of modern natural language parsing frameworks (e.g., Google's SyntaxNet?) can be viewed as variations of GLR where we slap a probability onto each possible transition.

nickpsecurity 39 days ago [-]

They do:

http://semanticdesigns.com/Products/DMS/DMSToolkit.html

With all that, what do we need another parsing tech for if it's programming related? ;) GLR was also used for NLP. The Elkhound approach mixes (IIRC!) GLR and LALR to get benefits of both with it easily handling one of hardest languages out there.

A little weird to leave off the list one of best techniques around.

Also, silentbicycle countered with it and some other on top of that here:

https://lobste.rs/s/pkfdah/parsing_timeline#c_suirwn

---

from https://news.ycombinator.com/item?id=16856694 , more discussion on https://jeffreykegler.github.io/personal/timeline_v3

danharaj 39 days ago [-]

Misses the subplot of Lambek's categorial grammar and the way it frames parsing as proof search. Pretty cool idea, perhaps underexplored.

haskellandchill 39 days ago [-]

I have some older books and heavy papers on the subject. Any recent lightweight treatments you can link to? Thanks.

danharaj 39 days ago [-]

This paper is fresh in my mind since I recently read it. I liked it: https://www.eecs.harvard.edu/shieber/Biblio/Papers/infer.pdf

We present a system for generating parsers based directly on the metaphor of parsing as deduction. Parsing algorithms can be represented directly as deduction systems, and a single deduction engine can interpret such deduction systems so as to implement the corresponding parser. The method generalizes easily to parsers for augmented phrase structure formalisms, such as definite-clause grammars and other logic grammar formalisms, and has been used for rapid prototyping of parsing algorithms for a variety of formalisms including variants of tree-adjoining grammars, categorial grammars, and lexicalized context-free grammars.

DonbunEf?7 39 days ago [-]

I knew Earley parsing before I read this paper. Now I know things about logic and theorem proving. Cool paper, would read again.

---

from https://news.ycombinator.com/item?id=16856694 , more discussion on https://jeffreykegler.github.io/personal/timeline_v3

---

" 1961: Dijkstra's shunting yard algorithm

In November 1961, Dijkstra publishes the "shunting yard" algorithm[47]. In Norvell's useful classification[48] of operator expression parsers, all earlier parsers have been what he calls "the classic algorithm".

Dijkstra's approach is new. In the classic approach, the number of levels of precedence is hard-coded into the algorithm. Dijkstra's algorithm can handle any number of levels of precedence without a change in its code, and without any effect on its running speed. The Operator Issue as of 1961

The results of 1961 transformed the Operator Issue. Before ALGOL, parsing operator expressions essentially was parsing. After ALGOL, almost all languages will be block-structured and ad hoc string manipulatons are no longer adequate -- the language as a whole requires a serious parsing technique. Parsing operator expressions becomes a side show, or so it seems.

Why not use the algorithms that parse operator expressions for the whole language? Samuelson and Bauer 1959 had suggested exactly that. But, alas, operator expression parsing is not adequate for languages as a whole[49].

Also by 1961, we have BNF. This gives us a useful notation for describing grammars. It allows us to introduce our Basic Operator Grammar (BASIC-OP):

      S ::= E
      E ::= E + T
      E ::= T
      T ::= T * F
      T ::= F
      F ::= number
    

BASIC-OP has two operators, three levels of precedence and left associativity. This was enough to challenge the primitive parsing techniques in use before 1961. Surprisingly, this simple grammar will continue to bedevil mainstream parsing theory for the next half a century.

Recursive descent, it turns out, cannot parse BASIC-OP because it is left recursive. And that is not the end of it. Making addition and multiplication right-associate is unnatural and, as the authors of the IT compiler found out, causes users to revolt. But suppose we try to use this Right-recursive Operator Grammar (RIGHT-OP) anyway:

      S ::= E
      E ::= T + E
      E ::= T
      T ::= F * T
      T ::= F
      F ::= number
    

Recursive descent, without help, cannot parse RIGHT-OP. As of 1961, parsing theory has not developed well enough to state why in a precise terms. Suffice it to say for now that RIGHT-OP requires too much lookahead.

But recursive descent does have a huge advantage, one which, despite its severe limitations, will save it from obsolescence time and again. Hand-written recursive descent is essentially calling subroutines. Adding custom modification to recursive descent is very straight-forward.

In addition, while pure recursive descent cannot parse operator expressions, it can recognize them. This means pure recursive descent may not be able to create the parse subtree for an operator expression itself, but it can recognize the expression and hand control over to a specialized operator expression parser. This seems to be what Lucas' 1961 algorithm did, and it is certainly what many other implementations did afterwards. Adding the operator expression subparser makes the implementation only quasi-Chomskyan, but this was a price the profession has been willing to pay.

Alternatively, a recursive descent implementation can parse operator expressions as lists, and add associativity in post-processing. This pushes some of the more important parsing out of the syntactic phase into the semantics but, once again, it seemed that Chomskyan purity had to be thrown overboard if the ship was to stay afloat.

Bottom line: as of 1961 the Operator Issue takes a new form. Because of the Operator Issue, recursive descent is not sufficient for practical grammars -- it must always be part of a hybrid.

In this context, Dijkstra's new 1961 algorithm is a welcome alternative: as an operator expression subparser, it can parse operator expressions faster and in less space. But Dijkstra's algorithm has no more parsing power than the classic operator precedence algorithm -- it does nothing to change the basic tradeoffs.

...

The Operator Issue as of 1965

Knuth 1965 is a significant milestone in the history of operator expresssion parsing. Knuth specifically addresses the parsing of ALGOL and he zeroes in on its arithmetic expressions as the crucial issue[55]. Knuth suggests a "typical" grammar[56] which is short, but encapsulates the parsing challenges presented by ALGOL's arithmetic expressions:

      S ::= E
      E ::= - T
      E ::= T
      E ::= E - T
      T ::= P
      T ::= T * P
      P ::= a
      P ::= ( E )
    

KNUTH-OP is left-recursive, allows parentheses, has three levels of precedence, and implements both unary and binary minus. Recursive descent cannot parse KNUTH-OP, but it is LR(1). The means that it is well within Knuth's new class of grammars and, by implication, probably within a practical subclass of the LR grammars.

...

The Operator Issue as of 1968

With Knuth 1965 and Lewis and Stearns 1968, we can now restate the problem with recursive descent and operator expressions in precise terms: Recursive descent in its pure form, is LL(1). Arithmetic operator grammars are not LL(1) -- not even close. In fact, of the grammars BASIC-OP, RIGHT-OP, and KNUTH-OP, none is LL(k) for any k.

Compromises have to be made if we are parsing with recursive descent. As one result of these compromises, truly declarative versions of LL(1) are not used. A pure declarative LL(1) parser generator could be written, but it would not be able to parse arithmetic expressions properly.

As mentioned, a common compromise is to have recursive descent parse arithmetic expressions as lists, and add associativity in post-processing. We are now able to look at this in more detail. An extended BNF grammar to recognize BASIC-OP as a list is as follows:

      S  ::= E
      E  ::= T { TI }*
      TI ::= '+' T
      T  ::= F { FI } *
      FI ::= 'x' F
      F  ::= number
    

In the above {Z}* means "zero or more occurences of Z". Expanded into pure BNF, and avoiding empty right hand sides, our operator "list" grammar becomes LIST-OP:

      S  ::= E
      E  ::= T TL
      E  ::= T
      TL ::= TI
      TL ::= TI TL
      TI ::= '+' T
      T  ::= F FL
      T  ::= F
      FL ::= FI
      FL ::= FI FL
      FI ::= 'x' F
      F  ::= number
    

LIST-OP is LL(1), and therefore can be parsed directly by recursive descent. Note that in LIST-OP, although associativity must be provided by a post-processor, the grammar enforces precedence.

" -- [14]

---

Link:

---

practicing writing grammars:

http://instaparse-live.matt.is/

docs: https://github.com/Engelberg/instaparse/blob/master/README.md#notation

---

Rob Pike suggests using imperative loops rather than regexps whenever possible, even in lexing:

https://commandcenter.blogspot.com/2011/08/regular-expressions-in-lexing-and.html

---

Python's Black formatter, Roslyn, and Swift keep track of comments (more generally, 'trivia') by associating them with nearby syntax nodes:

https://rtpg.co/2021/07/05/comments-ast.html https://lobste.rs/s/5tn4pd/how_black_handles_comments_when_python_s#c_p5m9lx https://github.com/dotnet/roslyn/blob/main/docs/wiki/Roslyn-Overview.md#syntax-trivia https://github.com/apple/swift/tree/main/lib/Syntax#trivia

IntelliJ? and rust-analyzer "store whitespace and comments as explicit nodes in the syntax tree" [15]

[16] says that "Swift’s docs are the best source to learn about design of syntax trees for IDE-like tooling." ---

A Regular Expression Matcher Code by Rob Pike Exegesis by Brian Kernighan

https://www.cs.princeton.edu/courses/archive/spr09/cos333/beautiful.html

---

https://siraben.dev/2022/03/01/tree-sitter.html How to write a tree-sitter grammar in an afternoon

---

https://eli.thegreenplace.net/2022/a-faster-lexer-in-go/

performance profiling of Golang's lexer

---

https://www.kmjn.org/notes/operator_precedence_fortran.html

--- coroutine-style lexers appear to be about 4x slower:

https://eli.thegreenplace.net/2022/performance-of-coroutine-style-lexers-in-go/

---

---

"

duped 4 months ago

prev next [–]

Working on tokenization and parsing there have been two "lights clicking on" moments that I think every dev working on a PL implementation should have :

It's easier to build a parser that doesn't buy into these things, but it's way harder to build tooling and good error messaging if you don't.

jstimpfle 4 months ago

parent next [–]

Could you clarify what you mean by "relative" file locations? Relative to containing AST node?

duped 4 months ago

root parent next [–]

Relative to the preceding token in the stream, in number of lines and columns from the end of the last token (or beginning, but to me that's conceptually confusing). Take a look at the language server protocol specification for semantic tokens with relative positions as a reference.

Using an absolute line/column/index to indicate the location of a token in a file means you need to retokenize a large amount of the file whenever there's an edit. In a streaming parser that responds to live edits, storing the tokens with relative file locations allows you to only update neighbors on insertion or deletion (and also allows for bulk insert/delete). Absolute file locations can be recovered by scanning the token stream - which is why it's important for the tokens to form the leaves of the tree, it guarantees you can always recover the file locations.

Then you can pull some tricks like the Roslyn compiler does, which is to store things like comments, diagnostics, and white space as "trivia" associated with the tokens and very quickly recover the text that created the token stream/AST nodes. That's invaluable for tooling.

jstimpfle 4 months ago

root parent next [–]

I agree, that's the right way to do it (not required for a batch compiler, though, because there are no edits). Text ropes work that way as well - contents are stored in a tree, and absolute coordinates are computed from relative coordinates using an associative operation (monoid).

duped 4 months ago

root parent next [–]

I have yet to see a batch compiler that wasn't hacked into some context where it's used as anything but a batch compiler! " -- [17]

"

---

on relations between grammar classes:

"LL(k) grammars are necessarily LR(k) but they are not necessarily LALR(k). There is an exercise in Appel's Modern Compiler Implementation textbook which provides an example of an LL(1) grammar which is not LALR(1); you can find the grammar transcribed in this answer. That should provide an idea on how to construct examples for k > 1. (Finding LALR(k) grammars which are not LL(k) is trivial: all you need is left recursion.)" -- [18]

---

"Automata theory and language classes are only really useful if you're trying to expand the state of the art (eg inventing treesitter). Even though I write parsers all the time, in practice what I need to remember is a) write recursive descent parsers (like most major language implementations) b) google "pratt parsing" when dealing with operator precedence." -- [19]

---

https://tratt.net/laurie/blog/2023/why_we_need_to_know_lr_and_recursive_descent_parsing_techniques.html https://lobste.rs/s/oq54qt/why_we_need_know_lr_recursive_descent

---

 "
 Rather, you work with what you got and build the best, potentially partial, syntax tree describing the input. The way to do that in an LL-style parser is just a couple of tricks, the first one of which is “if there’s an error, produce a partial node and continue”, and the second one is “when parsing something in a loop, step over errors to find something you can parse”.
 "
 -- https://lobste.rs/s/myyznl/tooling_for_tooling#c_uueu4d

"I don’t believe error resilient parsing is a solved problem. The only parser generator which does an acceptable job here is tree sitter, because it uses the novel idea of non-deterministically trying various parsers and picking the best one." -- https://lobste.rs/s/myyznl/tooling_for_tooling#c_nryvsf

---

great tutorial for real-world parsing:

https://matklad.github.io/2023/05/21/resilient-ll-parsing-tutorial.html

---

https://rust-analyzer.github.io/blog/2020/10/24/introducing-ungrammar.html

---

" ~ matklad 11 hours ago

link flag
    IntelliJ

So this post essentially describes how IJ’s PsiBuilder? works:

https://github.com/JetBrains/intellij-community/blob/idea/231.8109.175/platform/core-api/src/com/intellij/lang/PsiBuilder.java

matklad 2 hours ago

link flag

I mean that the “Resilient LL Parsing” article describes how PsiBuilder? works. It’s not exactly that, but is very close (eg, mark and precede stuff) "

https://matklad.github.io/2023/05/21/resilient-ll-parsing-tutorial.html

---

[20] recommends designing languages to be LR(1)

https://lobste.rs/s/ri4eeu/which_parsing_approach_2020 https://lobste.rs/s/9pcqys/which_parsing_approach

(note however that most other stuff i've seen recommends a hand-rolled recursive descent parser, combined with Pratt operator precedence/precedence climbing, and LL(1) overall except for the operator precedence sub-parser stuff; see https://lobste.rs/s/9pcqys/which_parsing_approach#c_1qw5mu ; some ppl prefer PEG parsing)

https://rust-analyzer.github.io//blog/2020/09/16/challeging-LR-parsing.html

---

" matklad 4 days ago

link flag

FWIW, my experience is that slices are not that much useful for a hand-written parser.

It usually is most convenient to code the parser against the API of roughly the following shape:

impl TokenStream? { fn lookahead(&self, n: usize) -> Token; fn advance(&mut self); }

Curiously, stdio’s FILE with its ungetc is actually roughly this shape (if your token is char, and if you only need one token of lookahead).

Naturally, you want bounds checking to happen somewhere, and slices can help with that when implementing TokenStream?, but, for the actual parsing itself, you don’t actually want to interact with slices much, because you don’t need, and because hard-coding slices actually reduces parsers flexibility, as the underlying data source might as well be rope or something else non-contiguous.

" -- https://lobste.rs/s/nbcn3g/simply_parse_c#c_neakfx

---

Pratt parsing, and shunting yard parsing:

https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing https://matklad.github.io/2020/04/15/from-pratt-to-dijkstra.html

https://www.reddit.com/r/oilshell/comments/5l70p7/pratt_parsing_and_precedence_climbing_are_the/ https://en.wikipedia.org/wiki/Shunting_yard_algorithm

---

Building a high performance JSON parser

---