I'm writing a parser for a niche markup language like Markdown. To make things easier to understand, I'll just use Markdown as an example.
I've done some research and learned that the conventional way to parse a markdown document into an AST consists of the following steps:
- Parse the document into an array of tokens.
- Transform the tokens into an AST.
I'm having a problem with the first step. Most tutorials online use regex to do it. here is an example written in javascript:
function getTokensByRule(text, rule) {
const tokens = [];
let match = rule.exec(text);
do {
tokens.push(match);
} while((match = rule.exec(text)) !== null);
return tokens;
}
function getTokens(text) {
// overly simplified rules, there may be a lot more in reality
const rules = {
italic: /\*[^*]*\*/g,
code: /`[^`]*`/g,
};
const tokens = [];
for (const ruleName in rules) {
const rule = rules[ruleName];
tokens.push(...getTokensByRule(text, rule))
}
return tokens;
}
const tokens = getTokens("some `markdown` *texts*");
console.log(tokens);
// output:
// [
// [
// '*texts*',
// index: 16,
// input: 'some `markdown` *texts*',
// groups: undefined
// ],
// [
// '`markdown`',
// index: 5,
// input: 'some `markdown` *texts*',
// groups: undefined
// ]
// ]
It works, but I see a problem in it: this would probably be slow. It parses the document against each regex rule. The time complexity would be close to O(n * r), where n is the length of the document and r is the number of regex rules.
So my question is, is this the conventional way to parse the markdown language? Is there any faster or better way to achieve it?