What's the conventional way to parse a markdown-like language into an abstract syntax tree?

Question

I'm writing a parser for a niche markup language like Markdown. To make things easier to understand, I'll just use Markdown as an example.

I've done some research and learned that the conventional way to parse a markdown document into an AST consists of the following steps:

Parse the document into an array of tokens.
Transform the tokens into an AST.

I'm having a problem with the first step. Most tutorials online use regex to do it. here is an example written in javascript:

function getTokensByRule(text, rule) {
    const tokens = [];
    let match = rule.exec(text);
    do {
        tokens.push(match);
    } while((match = rule.exec(text)) !== null);
    return tokens;
}

function getTokens(text) {

    // overly simplified rules, there may be a lot more in reality
    const rules = {
        italic: /\*[^*]*\*/g,
        code: /`[^`]*`/g,
    };

    const tokens = [];
    for (const ruleName in rules) {
        const rule = rules[ruleName];
        tokens.push(...getTokensByRule(text, rule))
    }
    return tokens;
}

const tokens = getTokens("some `markdown` *texts*");
console.log(tokens);

// output:
// [
//     [
//         '*texts*',
//         index: 16,
//         input: 'some `markdown` *texts*',
//         groups: undefined
//     ],
//     [
//         '`markdown`',
//         index: 5,
//         input: 'some `markdown` *texts*',
//         groups: undefined
//     ]
// ]

It works, but I see a problem in it: this would probably be slow. It parses the document against each regex rule. The time complexity would be close to O(n * r), where n is the length of the document and r is the number of regex rules.

So my question is, is this the conventional way to parse the markdown language? Is there any faster or better way to achieve it?

Here's your answer (from Software Engineering) - https://softwareengineering.stackexchange.com/questions/149379/is-there-a-more-modern-program-than-lex-or-yacc-which-does-not-require-jvm — ControlAltDel, Jun 03 '22 at 02:22

What's the conventional way to parse a markdown-like language into an abstract syntax tree?

0 Answers0