I'm playing around since a while with building a little open source project based on the output of facebooks pfff tool, that can parse multiple programming languages and output some kind of unified AST.
For example, given a simple PHP file like this:
<?php
class FooBar {
public function functionA(string $arg1)
{
$localVar = 1;
print $undefinedVar;
print $arg1;
}
...
The output of the tool is something like this:
[ClassDef(
{c_type=ClassRegular(i_1); c_name=Name(("FooBar", i_2)); c_tparams=None;
c_extends=None; c_implements=None; c_attrs=None;
c_body=(i_3,
[Method(
{f_tok=i_4; f_type=MethodRegular; f_attrs=None;
f_modifiers=[(Public, i_5)]; f_ref=None;
f_name=Name(("functionA", i_6)); f_tparams=None;
f_params=(i_7,
[Left3(
{p_variadic=None; p_attrs=None; p_modifier=None;
p_soft_type=None;
p_type=Some(Hint(
XName([QI(Name(("string", i_8)))]),
None));
p_ref=None; p_name=DName(("arg1", i_9));
p_default=None; })],
i_10);
f_return_type=None;
f_body=(i_11,
[ExprStmt(
Assign(
IdVar(DName(("localVar", i_12)), Ref(NoScope)),
i_13, Sc(C(Int(("1", i_14))))), i_15);
ExprStmt(
Print(i_16,
IdVar(DName(("undefinedVar", i_17)),
Ref(NoScope))), i_18);
ExprStmt(
Print(i_19,
IdVar(DName(("arg1", i_20)), Ref(NoScope))),
i_21)],
i_22);
});
...
Having a look at the output I thought it would be quite straightforward to create parser that can give me all class names or all methods or all functions etc.
I wrote a very ugly first parser that iterates over the output as a string, looks for keywords, like ClassDef, then tries to match the next brace and so on. I managed to extract some stuff and it works ok, but the code got convoluted and ugly very fast, to the point where I just don't want to work with it anymore and reconsider that approach. Adding more features to extract more data from the AST is hardly possible anymore.
So I did some more research and found basically two approaches:
- Roll my own recursive descent parser, define a formal grammar for it. I read through the wiki page for it, but had a very hard time to get going as I don't have much experience with creating parsers. I read through some more tutorials, but I'm still not so sure how to start on parsing a complex string like the above.
- Use some kind of parser framework like ANTLR. Apart from the fact that I would like to not use Java im my project, I'm also not so sure if that's the easier approach.
My ideal solution would be to either have an in-memory structure that allows me to query the parsed string like "give me all methods with classes" or so or convert the output from above to XML.
Currently I'm quite stuck at how to approach parsing the above structure. It would be much appreciated if someone with more experience in creating parsers could give me a hint in the right direction, maybe I'm also making it too complicated, but defining a grammar and then maybe writing the parser for it for the above structure seems like a big undertaking for me and I don't really know where to start.