-1

I'm writing a parser for a niche markup language like Markdown. To make things easier to understand, I'll just use Markdown as an example.

I've done some research and learned that the conventional way to parse a markdown document into an AST consists of the following steps:

  1. Parse the document into an array of tokens.
  2. Transform the tokens into an AST.

I'm having a problem with the first step. Most tutorials online use regex to do it. here is an example written in javascript:

function getTokensByRule(text, rule) {
    const tokens = [];
    let match = rule.exec(text);
    do {
        tokens.push(match);
    } while((match = rule.exec(text)) !== null);
    return tokens;
}

function getTokens(text) {

    // overly simplified rules, there may be a lot more in reality
    const rules = {
        italic: /\*[^*]*\*/g,
        code: /`[^`]*`/g,
    };

    const tokens = [];
    for (const ruleName in rules) {
        const rule = rules[ruleName];
        tokens.push(...getTokensByRule(text, rule))
    }
    return tokens;
}

const tokens = getTokens("some `markdown` *texts*");
console.log(tokens);

// output:
// [
//     [
//         '*texts*',
//         index: 16,
//         input: 'some `markdown` *texts*',
//         groups: undefined
//     ],
//     [
//         '`markdown`',
//         index: 5,
//         input: 'some `markdown` *texts*',
//         groups: undefined
//     ]
// ]

It works, but I see a problem in it: this would probably be slow. It parses the document against each regex rule. The time complexity would be close to O(n * r), where n is the length of the document and r is the number of regex rules.

So my question is, is this the conventional way to parse the markdown language? Is there any faster or better way to achieve it?

Searene
  • 22,811
  • 37
  • 122
  • 172
  • Here's your answer (from Software Engineering) - https://softwareengineering.stackexchange.com/questions/149379/is-there-a-more-modern-program-than-lex-or-yacc-which-does-not-require-jvm – ControlAltDel Jun 03 '22 at 02:22

0 Answers0