Can you typeset documents in a universal markup language?

Question

I was recently thinking about preparing a document, and if I could structure the information of that document in the most universal way possible. Ideally, document elements should be tagged as what they are - header, tagline, chapter, etc. Naturally, there is variation from document to document, so people can make their own tag labels.

Basically, this is just HTML, except I wanted to be sure every piece of information was tagged and structured, so that the formatting of the abstract “text information / data structure” could be switched between many types instantly, with no need for modifying.

That just means that sentences should also be tagged, not just paragraphs with

tags, and ideally, any kind of diagram or figure is easy to break into pieces, to modularize, as well. Why? Because the point of this format is that everything, every element, is identified on the smallest possible level, so that for different formats, you can design a different system for displaying that data structure in that medium.

So, imagine we have this abstract document markup - title, author, section, paragraph, sentence, list, table, table header, etc.

Ideally, we can just plug it in to some style/formatting package to make a new type of presentation - it can be a webpage - it can be a single piece of paper, like a large kind of research presentation poster - it can be a paginated book - it can be a PowerPoint presentation - it can be an automatically generated video - it can even be an interactive reading application where you navigate through pressing “next” or “back” - etc. Each format knows how to handle the “universal markup”. For example, the universal markup may include a citation for a sentence - or it may include a link, for a word. The formats would know that in a book, citations can be at the bottom; in an interactive reader, they can be omitted; on a website, perhaps they could be expressed as hyperlinks on that text, or a parentheses at the end of the sentence (“see here”). Whereas a hyperlink in a document would be expressed normally, in a book, it could either be a footnote, omitted, or maybe an annotation blurb in the margins. And so on.

So: has anyone considered a minor extension to HTML focusing on explicitly tagging all elements with their abstract function or role in the document could be a universal document markup language to easily be formatted in myriad ways?

score 0 · Answer 1 · answered Dec 17 '22 at 00:43

Technically, this is entirely possible (and done in fact by a number of publishers/companies/people). One related term/concept might be "single-source publishing", which means, a number of output formats get generated from the same "universal" source.

There's a number of difficulties thought. If you truly represent every little detail with it's own special markers which may also be local/specific to the context of the document, it could potentially mean that, in order to accurately represent the document in a more consumable format and to support all the specific extra features, you might need to add a lot of custom, unique styles, which then might not be reusable for any other document/context.

Also, document features might not necessarily all too compatible. Links in the source are fine to be clickable for a Web HTML export, likely fine in an e-ink e-reader too (except no guarantees there's Internet connectivity), but they may wrap/display poorly on printout and wouldn't be clickable anyway, and then there's also a difference for what you would want as link caption as compared to a link in printed text, where it's probably better as a footnote/endnote. Many little details like that.

If HTML is fine, then it's especially easy, because HTML (for Web documents) is part of the XML family (or say, HTML created in the XHTML fashion, which means it's XML-compliant/-compatible), and with XML, you can mix XML formats, as well as add/introduce your own custom formats, to then use XML tooling to extract/react to what's in the file. There's a whole range of XML formats to support various aspects of structuring text, for example take a look at TEI (Text Encoding Initiative).

Can you typeset documents in a universal markup language?

1 Answers1