The Only Two Markup Languages
2026-01-19
There are only two families of proper arbitrary markup languages: TeX and SGML I would normally link to official thing as reference but it’s behind the “wonderful” ISO paywall: ISO 8879:1986.. By arbitrary, I mean the grammar specifically, and how it can be used mark arbitrary plain text with information. And by proper, I mean the ability to have standalone nodes, user-definable nodes, nodes with attributes, and the wrapping of plain text. Everything else either lacks one of the these capabilities, or is a derivative or syntactic-makeover of TeX or SGML.
The Two Families
The TeX family:
\foo
\foo{wrapped text}
\foo[attrib=value]{wrapped text}
\foo[attrib=value]
The SGML family:
<foo />
<foo>wrapped text</foo>
<foo attrib="value">wrapped text</foo>
<foo attrib="value" />
This does mean I am excluding things like Markdown I use a variant of Markdown to write these articles. I use a custom extension of Commonmark which I’ve added the ability to write these margin-notes and some other things like trivial YouTube video embedded., troff, IBM’s GML, Wiki, Emacs Org-Mode etc. The reason for this exclusion is because they are neither arbitrary nor proper. They have procedural semantic meaning to their syntax and it cannot be arbitrarily extended.
For example in Markdown, [text](link) has a very specific intrinsic semantic meaning. Whilst in SGML/XML, <foo>blah</foo> has no intrinsic semantic meaning without some extra program to enforce that.
For other similar markup languages, things like BBCode I am classing as part of the the SGML family, and Scribe as part of the TeX family. BBCode is effectively a near-subset of XHTML but uses [] instead of <>.
Scribe’s syntax is effectively to TeX but there is no need for the {} and either requires paired blocks (e.g. @Begin(Quotation) and @End(Quotation)) or adding the plain text as part of the attributes (e.g. @Foo(tag=bar, title="The Title")). This is why I think Scribe’s syntax is fundamentally flawed because of the lack of {}-like wrapping ability, compared to actual TeX.
Edge Cases of the Syntaxes
In the two languages, they both have means of using their specific symbols that are used for writing the mark-up. For SGML you do <
For a minimal syntax to prevent escaping issues in an SGML-like language, you need 5 escaped entities: & → &, ’ → ' or ', < → <, > → >, and " → " or ". However there are literally thousands of XML/HTML entities out there, and supporting them correctly has been “fun” for packages in Odin., and for TeX you do \\. The SGML approach differs to TeX as it is a different syntax than regular markup, whilst TeX unifies its syntax.
There is also the second aspect that the TeX family of syntaxes are much easier parse than the SGML family. I’ve written both before and the SGML syntax requires an order of magnitude more code to write, because of the named blocks for wrapping.
SGML wrapping syntax also has the flaw of allowing for overlapping hierarchies:
<a><b>text</a></b>
TeX syntax does not have this problem as it only uses generic brackets/braces. I cannot think of a case when overlapping hierarchies this is desired—in fact HTML parsers have to mitigate for this possible typo.
Real life TeX syntaxes do have their own edge cases, deviating from the general markup syntax If you want a good example of this, I recommend reading this wonderful Cheatsheet to see the numerous syntactic exceptions which exist for practical pragmatic purposes, even if that means the syntax parser is now hell of a lot more complex., but this is more to do with wanting to express mathematical formulations in real text rather than keep to a syntactic “purity”.
JSON is not an Alternative
Expanding upon the proper aspect, the need for attributes is very important. It is common to see people replace XML with JSON nowadays, as JSON is easier to read, write, and parse than XML. However JSON is a strict tree with no attribute system. Attributes are a necessary aspect of a markup language as they allow for adding extra information to a node without requiring any children nor wrapped text. Many people emulate attributes in JSON with other sibling nodes, which is not equivalent at all. Attributes (or tagging) is an important thing as it adds extra information to tree-like structures, and sadly it’s missed in a lot of languages. JSON is also not a markup language in that it cannot be used to markup arbitrary text, rather it is has a specific hierarchical format it requires and it cannot be placed anywhere within plain text.
I’d also argue other languages like YAML or TOML are definitely not forms of Markup Languages, even if YAML is literally named “Yet Another Markup Language”. These are both forms of configuration languages. And when people have replaced XML with JSON, it’s because they were used XML as a (verbose) configuration language than any kind of markup language. And I’d even argue JSON should not be used as a configuration language either and only a lightweight text-based data-interchange format, since that is its entire purpose.
n.b. I am still not sure why people think XML is a “human readable” language, and keep repeating this adage. Yes it is “readable” but it is not quickly comprehendible. Please use nearly anything but XML for a configuration language. INI is honestly still good for most people’s needs.
YAML is also a monstrosity and should never be used by anyone for any reason. It’s nigh-impossible to write a parser for it and has too many syntactical ambiguities. It has lead to numerous infamous situations such as The Norway Problem Just say Norway to YAML.. Also fun fact, YAML is actually a supset of JSON which all valid JSON documents are also valid YAML documents. This is beyond cursed, but as this is not an article on YAML, I will stop with my micro-rant here.
Applying this to Odin
Odin The general purpose programming language that I have created. Which I hope most people who read my articles know this already. is not a markup language, but I deliberately designed three distinct extension mechanisms so the language can grow in the future without forcing new foundational syntax. I achieve this with attributes on declarations, struct tag fields, and directives. These three different constructs have different syntaxes because they reflect different semantic meaning.
Attributes can be applied to any declaration and have the following syntax:
@(key)
@(key=value)
@(a=b, c=d)
@key // as a minor shorthand
The attributes can be applied arbitrarily too for different purposes. This means if some new functionality is needed in Odin, it can be added.
The next syntax is struct field tags, which is just a string literal applied to the end of a struct field.
Foo :: struct { x: T `extra information here` }
This extra information is stored in RTTI and then used at runtime by the program/user to do whatever they need at runtime.
The last is the directive syntax. It’s general syntax looks like this:
#key
#key(args)
And these can be either standalone, or applied to expressions, types, or statements. They are not applied to declarations to keep a distinct semantic meaning. These are all forms of a kind of semantic mark-up, but they exist as an escape hatch for future (or generally present) needs where the syntax might have been limiting. Attributes and directives are both TeX like whilst struct field tags are only a kind of “attribute syntax”.
Conclusion
Most of the papers Coombs, James H.; Renear, Allen H.; DeRose, Steven J. (November 1987). “Markup Systems and the Future of Scholary Text Processing” Bray, Tim (9 April 2003). “On Semantics and Markup, Taxonomy of Markup” I have read don’t seem to talk about the syntactical distinction in this article, but rather make only semantic distinctions such as presentational, procedural, or descriptive.
I know this is probably a very nerdy syntax post, but I do find it interesting, and I don’t know if it has been talked about before by other people. If anyone knows of any other family of proper arbitrary markup syntaxes, please tell me as I’d love to know, but most of them seem to fall into one of these two categories. I personally prefer the TeX family because it removes the problems of overlapping hierarchies, minimizes clutter from redundant characters, and is a heck of a lot simple to parse.