31

As a side-project, I'm writing a language using Python. I started by using a flex/bison clone called Ply, but am coming up against the edges in the power of what I can express with that style of grammar, and I'm not interested in hacking up my language because of an impedance mismatch with the tool. Therefore, I'm not averse to writing my own.

So what's the most powerful type of parser? Citations to papers (as well as more introductory articles) would be welcome.

(I know that 'powerful' is not precisely defined, but let's be a little loose with it and see where the answers go)

Hermann Gruber
  • 6,470
  • 1
  • 35
  • 58
Paul Biggar
  • 420
  • 4
  • 8
  • 1
    Downvoted: not research level. – Warren Schudy Oct 23 '10 at 01:07
  • 3
    @Warren: I checked the FAQ before asking - that doesn't seem to be a requirement. – Paul Biggar Oct 23 '10 at 01:30
  • 1
    there are actually two FAQs, one for the general site and one for CStheory. The CStheory one indicates that questions that can be answered by e.g. reading Wikipedia are off-topic; see "What kind of questions are too basic?" in http://meta.cstheory.stackexchange.com/questions/225/official-faq-for-theoretical-computer-science . – Warren Schudy Oct 23 '10 at 01:45
  • 1
    @Warren: That's the FAQ I read. I had read wikipedia, but I felt this needed actual insight. – Paul Biggar Oct 23 '10 at 01:55
  • 1
    To you mean parsers in production or theoretical ones, i.e. ones that cover grammar types other than CFG? – Raphael Oct 23 '10 at 08:44
  • @Warren There has been much research on this ... the last 20 years are only missing from the answers. Much is in computational linguistics, though it would qualify for TCS. The linguists on stackexchange do a different kind of stuff, I believe. – babou Jun 12 '13 at 22:28

4 Answers4

35

A grammar is usually defined as a Context Free grammar - a precise definition is given on the Wikipedia page, but it works the same as it does in PLY, which is based on Bison, which is in turn based on yacc.

It says here that PLY uses a LALR parser. This is essentially an LR parser where the lookup tables are condensed, possibly introducing parsing conflicts, reducing some of the expressiveness of an LR grammar (ie, a context free grammar that an LR parser can parse). If you want to know about the limitations of this particular branch of parsers and those of other parsers, an overview of all kinds of parsing techniques (LL, LR and others) is given here.

To answer your question: there exist parsing algorithms capable of parsing any context-free language, even if the language is ambiguous (ie, there is more than one way to interpret the input):

The first such algorithm was the CYK algorithm, which unfortunately has a running time of $O(n^3 |G|)$, where $n$ is the length of the input string and $|G|$ is the size of the grammar and is therefore impractical for parsing languages.

The second algorithm is the Earley algorithm. This algorithm is also capable of parsing any context free grammar. Although the algorithm needs $O(n^3)$ time to parse an ambiguous language, it only needs $O(n^2)$ time to parse an unambiguous language. In addition, it apparently works in linear time for most LR grammars and works particularly well on left-recursive grammars.

Here you can find a paper discussing a practical implementation of (an adaptation of) the Earley algorithm. They conclude: "Given the generality of Earley parsing compared to LALR(1) parsing ((which is roughly what PLY does)), and considering that even PEP’s ((their implementation of Earley's algorithm)) worst time would not be noticeable by a user, this is an excellent result".

The last type of parser is the GLR parser. This is a generalised version of LR parsing, capable of parsing any context-free language.

A mature implementation of GLR is ASF+SDF. Bison can also generate a GLR parser, though its implementations is slightly different from the 'standard' GLR algorithm. The Elkhound Algorithm is a GLR/LALR hybrid algorithm. It uses LALR when possible and GLR when needed, in order to be both fast and capable of parsing any grammar.

Beyond context free grammars there are context sensitive grammars, but these are in general hard to parse and don't add that much expressiveness: you can do more with them, but for most applications the extra uses are not relevant, unless you're parsing a natural language.

As the final step there are unrestricted grammars. At this point the grammar is Turing-complete, so there is no bound one can give on how long it will take to parse a particular language, which is undesirable for most parsing applications. The extra power is almost never needed. If you do want to use all that power, there is the language machine available.

Lastly, implementing your own parser-generator is not a trivial affair, in particular to get it to be fast. I've personally just finished making my own version of flex (the lexer generator), and while this seemed like an exercise in relatively simple algorithmic problems, it became quite complex to get right, in particular when I tried to support Unicode. Consider using an already existing implementation instead of writing your own.

user2373145
  • 103
  • 2
Alex ten Brink
  • 4,680
  • 25
  • 48
  • 1
    Excellent answer!! Any thoughts on how PEGs fit in? – Paul Biggar Oct 23 '10 at 00:44
  • 2
    PEGs are 'different' than CFGs: there are CFGs that are not PEGs and vice versa. I refer you to here: http://stackoverflow.com/questions/1857022/limitations-of-peg-grammar-parser-generators. – Alex ten Brink Oct 23 '10 at 01:15
  • This might also be of interest: http://blogs.ethz.ch/copton/2009/07/08/parsing-expression-grammars/. – Alex ten Brink Oct 23 '10 at 01:18
  • 1
    Actually, most common parser generators (yacc, Antlr, bison) allow non-CF concepts by predicates or arbitrary code that checks wether one rule can be applied resp. deciding precedence. This can be used to implement static semantics mainly since basic syntax remains in essence context free. – Raphael Oct 23 '10 at 08:48
  • What about recursive languages? is there any real power for languages in them compared to context-sensitive? thanks :) – Lasse Espeholt Oct 23 '10 at 20:28
  • 1
    Recursive languages are precisely the languages decidable by always-halting Turing Machines. Any context sensitive language is therefore also recursive, but since context sensitive languages are decidable in exponential time, there are recursive languages that are not context sensitive. Unrestricted grammars are even more powerful: the halting problem can be described by an unrestricted grammar, but is not a recursive language. – Alex ten Brink Oct 24 '10 at 00:19
  • Yup, I know the range of languages, but I was wondering what recursive languages brings to the table. An example recursive language which is not context-sensitive :)? – Lasse Espeholt Oct 24 '10 at 07:43
  • Since deciding membership for a context-sensitive grammar is in PSPACE, any language in EXPSPACE or larger classes is expressable by a recursive grammar but not by a context-sensitive grammar. An example is the language of all true propositions in Presburger arithmetic (http://en.wikipedia.org/wiki/Presburger_arithmetic). Since we suspect that PSPACE $\neq$ EXPTIME, any language in EXPTIME or above would be sufficient. The extra power is almost never needed though - it would take too long to parse anyway. – Alex ten Brink Oct 25 '10 at 00:54
  • What makes you believe that GLR parsing is more effective in time or space than simpler versions of tabular parsing not using the LR construction (there are many) ? This is not an idle question. (I am not talking of complexity class, which is more or less the same for all). – babou Jun 12 '13 at 21:30
  • @babou: I assume you are talking about algorithms like CYK and Earley. CYK is always cubic, so it is not very useful on near-deterministic grammars that occur in practice. If you take Earley and optimize it, you get GLL, a relatively new generalized parsing algorithm, that is usually even faster than GLR. The original Earley algorithm is not that fast (talking about constant factors) because it often needs to do a linear search for things that GLL already has a pointer to. – Alex ten Brink Jun 13 '13 at 08:17
  • @Alex No, I am not talking about CYK, obviously, nor Earley. I am talking more abstractedly about the effect of predictiveness on the efficiency of this family of parsing algorithms. That would actually make a nice question by itself. Also parsing is supposed to produce parse trees (or forests). Would you think that the result is not impacted by the chosen algorithm ? These are not idle questions when you extend parsing techniques beyond the standard CF. – babou Jun 13 '13 at 09:13
  • @AlextenBrink do you know that GLL is usually faster, and for what grammars? It is usually faster on worst case grammars, but for PL grammars we have yet to find out. Do you have already more information on this? – Jurgen Vinju Nov 25 '13 at 09:42
  • @jurgenv This paper by Billot and Lang may shed some light on this. They built a system that implements in a uniform way various general CF parsers, for systematic comparison rather than for production purpose. The analysis of their effectiveness on various grammars give hints regarding the some reasons for the compared efficiencies of different parsing strategies. – babou Nov 14 '14 at 17:23
15

A paper at ICFP 2010 this year, Total Parser Combinators, describes a provably terminating parser combinator library and also establishes that in this library "parser combinators are as expressive as possible" given that the parser is guaranteed to terminate. Unfortunately I don't remember the explanation the author gave for what "as expressive as possible" means, but it certainly seems relevant to your question about "power."

Rob Simmons
  • 2,396
  • 13
  • 24
  • 1
    I have a car that does not pollute, Actually it does not move either... So the question is : what kind of language is parsed by this library ? Not meaning that this work is not interesting, of course. – babou Jun 13 '13 at 09:49
2

If you want to go beyond context-free grammars for parsing programming languages, but still parse in polynomial time, you can resort to parsing expression grammars, or boolean grammars - the latter are also available in LL and LR flavors (see here). In formal language theory, also the powerful yet linear-time recognizable Church-Rosser languages are studied, but I am not aware of any implemented parser generators for these.

In natural language processing, tastes are different, for instance, dealing with ambiguity (also: inherent ambiguity) and free word order plays a very prominent role. Here the keywords mildly context sensitive languages and restart automata might help you to start reading.

Hermann Gruber
  • 6,470
  • 1
  • 35
  • 58
  • 1
    Considering the way the question was asked, and the complaint that CF is too constraining, your answer is clearly the best. So it goes ... – babou Jun 12 '13 at 22:42
  • 1
    The link to Alexander Okhotin doesn't work now - the new website seems to be here: https://users.math-cs.spbu.ru/~okhotin/ – Martin Aug 12 '22 at 12:48
0

Parser Generator Tools:

ANTLR is very good. Alternatively, you can have a look at JavaCC

  • I am not a computer scientiest (despite what my degree says ;), so my words might weigh lightly here. I agree with Sazzad - ANTLR is a very powerful tool. It is very complete, and I have yet to find any problems with the parser generator (LL(k) if I recall correctly). On the other hand, I have yet to implement a compiler of for a somewhat complex grammar... – Jörgen Sigvardsson Dec 21 '10 at 19:54
  • 5
    I think you're missing the point of the question, and perhaps the entire site. It's about parsing theory, not about implementations and tools. – Paul Biggar Dec 22 '10 at 06:41